Run LTRharvest to predict putative LTR Retrotransposons

This function implements an interface between R and the LTRharvest command line tool to predict putative LTR retrotransposons from R.

LTRharvest(
  genome.file,
  index.file = NULL,
  range = c(0, 0),
  seed = 30,
  minlenltr = 100,
  maxlenltr = 3500,
  mindistltr = 4000,
  maxdistltr = 25000,
  similar = 70,
  mintsd = 4,
  maxtsd = 20,
  vic = 60,
  overlaps = "no",
  xdrop = 5,
  mat = 2,
  mis = -2,
  ins = -3,
  del = -3,
  motif = NULL,
  motifmis = 0,
  output.path = NULL,
  verbose = TRUE
)

Arguments

genome.file	path to the genome file in `fasta` format.
index.file	specify the name of the enhanced suffix array index file that is computed by `suffixerator`. This opten can be used in case the suffix file was previously generated, e.g. during a previous call of this function. In this case the suffix array index file does not need to be re-computed for new analyses. This is particularly useful when running `LTRharvest` with different parameter settings.
range	define the genomic interval in which predicted LTR transposons shall be reported . In case `range[1] = 1000` and `range[2] = 10000` then candidates are only reported if they start after position 1000 and end before position 10000 in their respective sequence coordinates. If `range[1] = 0` and `range[2] = 0`, so `range = c(0,0)` (default) then the entire genome is being scanned.
seed	the minimum length for the exact maximal repeats. Only repeats with the specified minimum length are considered in all subsequent analyses. Default is `seed = 30`.
minlenltr	minimum LTR length. Default is `minlenltr = 100`.
maxlenltr	maximum LTR length. Default is `maxlenltr = 3500`.
mindistltr	minimum distance of LTR starting positions. Default is `mindistltr = 4000`.
maxdistltr	maximum distance of LTR starting positions. Default is `maxdistltr = 25000`.
similar	minimum similarity value between the two LTRs in percent. `similar = 70`.
mintsd	minimum target site duplications (TSDs) length. If no search for TSDs shall be performed, then specify `mintsd = NULL`. Default is `mintsd = 4`.
maxtsd	maximum target site duplications (TSDs) length. If no search for TSDs shall be performed, then specify `maxtsd = NULL`. Default is `maxtsd = 20`.
vic	number of nucleotide positions left and right (the vicinity) of the predicted boundary of a LTR that will be searched for TSDs and/or one motif (if specified). Default is `vic = 60`.
overlaps	specify how overlapping LTR retrotransposon predictions shall be treated. If `overlaps = "no"` is selected, then neither nested nor overlapping predictions will be reported in the output. In case `overlaps = "best"` is selected then in the case of two or more nested or overlapping predictions, solely the LTR retrotransposon prediction with the highest similarity between its LTRs will be reported. If `overlaps = "all"` is selected then all LTR retrotransposon predictions will be reported whether there are nested and/or overlapping predictions or not. Default is `overlaps = "best"`.
xdrop	specify the xdrop value (> 0) for extending a seed repeat in both directions allowing for matches, mismatches, insertions, and deletions. The xdrop extension process stops as soon as the extension involving matches, mismatches, insersions, and deletions has a score smaller than T -X, where T denotes the largest score seen so far. Default is `cdrop = 5`.
mat	specify the positive match score for the X-drop extension process. Default is `mat = 2`.
mis	specify the negative mismatch score for the X-drop extension process. Default is `mis = -2`.
ins	specify the negative insertion score for the X-drop extension process. Default is `ins = -3`.
del	specify the negative deletion score for the X-drop extension process. Default is `del = -3`.
motif	specify 2 nucleotides for the starting motif and 2 nucleotides for the ending motif at the beginning and the ending of each LTR, respectively. Only palindromic motif sequences - where the motif sequence is equal to its complementary sequence read backwards - are allowed, e.g. `motif = "tgca"`. Type the nucleotides without any space separating them. If this option is not selected by the user, candidate pairs will not be screened for potential motifs. If this options is set but no allowed number of mismatches is specified by the argument `motifmis` and a search for the exact motif will be conducted. If `motif = NULL` then no explicit motif is being specified.
motifmis	allowed number of mismatches in the TSD motif specified in `motif`. The number of mismatches needs to be between [0,3]. Default is `motifmis = 0`.
output.path	a path/folder to store all results returned by `LTRharvest`. If `output.path = NULL` (Default) then a folder with the name of the input genome file will be generated in the current working directory of R and all results are then stored in this folder.
verbose	logical value indicating whether or not detailed information shall be printed on the console.

Value

The LTRharvest function generates the following output files:

*_BetweenLTRSeqs.fsa : DNA sequences of the region between the LTRs in fasta format.
*_Details.tsv : A spread sheet containing detailed information about the predicted LTRs.
*_FullLTRRetrotransposonSeqs.fsa : DNA sequences of the entire predicted LTR retrotransposon.
*_index.fsa : The suffixarray index file used to predict putative LTR retrotransposonswith LTRharvest.
*_Prediction.gff : A spread sheet containing detailed additional information about the predicted LTRs (partially redundant with the *_Details.tsv file).

The ' * ' is an place holder for the name of the input genome file.

Details

The LTRharvest function provides an interface to the LTRharvest command line tool and furthermore takes care of the entire folder handling, output parsing, and data processing of the LTRharvest prediction.

Internally a folder named output.path_ltrharvest is generated and all computations returned by LTRharvest are then stored in this folder. These files (see section Value) are then parsed and returned as list of data.frames by this function.

LTRharvest can be used as independently or as initial pre-computation step to sufficiently detect LTR retrotransposons with LTRdigest.

References

D Ellinghaus, S Kurtz and U Willhoeft. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics (2008). 9:18.

Most argument specifications are adapted from the User manual of LTRharvest.

Author

Hajk-Georg Drost

Examples

if (FALSE) {

# Run LTRharvest for H sapines partial Y chromosome using standard parameters
LTRharvest(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"))
}