This function implements an interface between R and the LTRharvest command line tool to predict putative LTR retrotransposons from R.

LTRharvest(
  genome.file,
  index.file = NULL,
  range = c(0, 0),
  seed = 30,
  minlenltr = 100,
  maxlenltr = 3500,
  mindistltr = 4000,
  maxdistltr = 25000,
  similar = 70,
  mintsd = 4,
  maxtsd = 20,
  vic = 60,
  overlaps = "no",
  xdrop = 5,
  mat = 2,
  mis = -2,
  ins = -3,
  del = -3,
  motif = NULL,
  motifmis = 0,
  output.path = NULL,
  verbose = TRUE
)

Arguments

genome.file

path to the genome file in fasta format.

index.file

specify the name of the enhanced suffix array index file that is computed by suffixerator. This opten can be used in case the suffix file was previously generated, e.g. during a previous call of this function. In this case the suffix array index file does not need to be re-computed for new analyses. This is particularly useful when running LTRharvest with different parameter settings.

range

define the genomic interval in which predicted LTR transposons shall be reported . In case range[1] = 1000 and range[2] = 10000 then candidates are only reported if they start after position 1000 and end before position 10000 in their respective sequence coordinates. If range[1] = 0 and range[2] = 0, so range = c(0,0) (default) then the entire genome is being scanned.

seed

the minimum length for the exact maximal repeats. Only repeats with the specified minimum length are considered in all subsequent analyses. Default is seed = 30.

minlenltr

minimum LTR length. Default is minlenltr = 100.

maxlenltr

maximum LTR length. Default is maxlenltr = 3500.

mindistltr

minimum distance of LTR starting positions. Default is mindistltr = 4000.

maxdistltr

maximum distance of LTR starting positions. Default is maxdistltr = 25000.

similar

minimum similarity value between the two LTRs in percent. similar = 70.

mintsd

minimum target site duplications (TSDs) length. If no search for TSDs shall be performed, then specify mintsd = NULL. Default is mintsd = 4.

maxtsd

maximum target site duplications (TSDs) length. If no search for TSDs shall be performed, then specify maxtsd = NULL. Default is maxtsd = 20.

vic

number of nucleotide positions left and right (the vicinity) of the predicted boundary of a LTR that will be searched for TSDs and/or one motif (if specified). Default is vic = 60.

overlaps

specify how overlapping LTR retrotransposon predictions shall be treated. If overlaps = "no" is selected, then neither nested nor overlapping predictions will be reported in the output. In case overlaps = "best" is selected then in the case of two or more nested or overlapping predictions, solely the LTR retrotransposon prediction with the highest similarity between its LTRs will be reported. If overlaps = "all" is selected then all LTR retrotransposon predictions will be reported whether there are nested and/or overlapping predictions or not. Default is overlaps = "best".

xdrop

specify the xdrop value (> 0) for extending a seed repeat in both directions allowing for matches, mismatches, insertions, and deletions. The xdrop extension process stops as soon as the extension involving matches, mismatches, insersions, and deletions has a score smaller than T -X, where T denotes the largest score seen so far. Default is cdrop = 5.

mat

specify the positive match score for the X-drop extension process. Default is mat = 2.

mis

specify the negative mismatch score for the X-drop extension process. Default is mis = -2.

ins

specify the negative insertion score for the X-drop extension process. Default is ins = -3.

del

specify the negative deletion score for the X-drop extension process. Default is del = -3.

motif

specify 2 nucleotides for the starting motif and 2 nucleotides for the ending motif at the beginning and the ending of each LTR, respectively. Only palindromic motif sequences - where the motif sequence is equal to its complementary sequence read backwards - are allowed, e.g. motif = "tgca". Type the nucleotides without any space separating them. If this option is not selected by the user, candidate pairs will not be screened for potential motifs. If this options is set but no allowed number of mismatches is specified by the argument motifmis and a search for the exact motif will be conducted. If motif = NULL then no explicit motif is being specified.

motifmis

allowed number of mismatches in the TSD motif specified in motif. The number of mismatches needs to be between [0,3]. Default is motifmis = 0.

output.path

a path/folder to store all results returned by LTRharvest. If output.path = NULL (Default) then a folder with the name of the input genome file will be generated in the current working directory of R and all results are then stored in this folder.

verbose

logical value indicating whether or not detailed information shall be printed on the console.

Value

The LTRharvest function generates the following output files:

  • *_BetweenLTRSeqs.fsa : DNA sequences of the region between the LTRs in fasta format.

  • *_Details.tsv : A spread sheet containing detailed information about the predicted LTRs.

  • *_FullLTRRetrotransposonSeqs.fsa : DNA sequences of the entire predicted LTR retrotransposon.

  • *_index.fsa : The suffixarray index file used to predict putative LTR retrotransposonswith LTRharvest.

  • *_Prediction.gff : A spread sheet containing detailed additional information about the predicted LTRs (partially redundant with the *_Details.tsv file).

The ' * ' is an place holder for the name of the input genome file.

Details

The LTRharvest function provides an interface to the LTRharvest command line tool and furthermore takes care of the entire folder handling, output parsing, and data processing of the LTRharvest prediction.

Internally a folder named output.path_ltrharvest is generated and all computations returned by LTRharvest are then stored in this folder. These files (see section Value) are then parsed and returned as list of data.frames by this function.

LTRharvest can be used as independently or as initial pre-computation step to sufficiently detect LTR retrotransposons with LTRdigest.

References

D Ellinghaus, S Kurtz and U Willhoeft. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics (2008). 9:18.

Most argument specifications are adapted from the User manual of LTRharvest.

See also

Author

Hajk-Georg Drost

Examples

if (FALSE) {

# Run LTRharvest for H sapines partial Y chromosome using standard parameters
LTRharvest(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"))
}