Main pipeline to perform sufficient LTR retrotransposon predictions for any genome of interest.
LTRpred( genome.file = NULL, centromere_start = NULL, centromere_end = NULL, ltr_age_estimation = TRUE, model = "K80", mutation_rate = 1.3 * 1e-07, index.file.harvest = NULL, index.file.digest = NULL, LTRdigest.gff = NULL, tabout.file = NULL, LTRharvest.folder = NULL, LTRpred.folder = NULL, orf.file = NULL, annotate = NULL, Dfam.db = NULL, dfam.eval = 0.001, dfam.file = NULL, cluster = FALSE, clust.sim = 0.9, clust.file = NULL, copy.number.est = FALSE, fix.chr.name = FALSE, cn.eval = 1e-05, range = c(0, 0), seed = 30, minlenltr = 100, maxlenltr = 3500, mindistltr = 4000, maxdistltr = 25000, similar = 70, mintsd = 4, maxtsd = 20, vic = 60, overlaps = "no", xdrop = 5, mat = 2, mis = -2, ins = -3, del = -3, motif = NULL, motifmis = 0, aaout = "yes", aliout = "yes", pptlen = c(8, 30), uboxlen = c(3, 30), pptradius = 30, trnas = NULL, pbsalilen = c(11, 30), pbsoffset = c(0, 5), pbstrnaoffset = c(0, 5), pbsmaxedist = 1, pbsradius = 30, hmms = NULL, pdomevalcutoff = 1e-05, pbsmatchscore = 5, pbsmismatchscore = -10, pbsinsertionscore = -20, pbsdeletionscore = -20, pfam.ids = NULL, cores = 1, dfam.cores = NULL, hmm.cores = NULL, orf.style = 7, min.codons = 200, trans.seqs = FALSE, output.path = NULL, quality.filter = TRUE, n.orfs = 0, job_num = 1, verbose = TRUE )
genome.file | path to the genome file in |
---|---|
centromere_start | a numeric vector storing the centromere start coordinates in the |
centromere_end | a numeric vector storing the centromere end coordinates in the |
ltr_age_estimation | a logical value indicating wether or not ltr-age estimation shall be performed (default |
model | a model as specified in
|
mutation_rate | a mutation rate per site per year. For retrotransposons the default is \(mutation_rate = 1.3 * 10E-8\) (Wicker and Keller, 2007). |
index.file.harvest | specify the name of the enhanced suffix array index file that is computed
by |
index.file.digest | specify the name of the enhanced suffix array index file that is computed
by |
LTRdigest.gff | path to the LTRdigest generated GFF file, in case LTRdigest files were pre-computed previously. |
tabout.file | path to the LTRdigest generated tabout file file, in case LTRdigest files were pre-computed previously. |
LTRharvest.folder | either |
LTRpred.folder | name/path of/to an existing LTRpred folder. |
orf.file | path to the file generated by |
annotate | annotation database that shall be queried to annotate predicted LTR transposons.
Default is |
Dfam.db | folder path to the local Dfam database or |
dfam.eval | E-value threshhold to perform HMMer search against the Dfam database. |
dfam.file | path to pre-computed |
cluster | shall predicted transposons be clustered with |
clust.sim | cluster reject if sequence similarity is lower than this threshold when performing clustering with |
clust.file | file path to pre-computed clustering file generated by |
copy.number.est | shall copy number estimation (including solo LTR prediction) be performed? Default is |
fix.chr.name | shall chromosome names be fixed? |
cn.eval | evalue for copy number estimation (BLAST hit threshold). |
range | define the genomic interval in which predicted LTR transposons shall be reported
. In case |
seed | the minimum length for the exact maximal repeats. Only repeats with the specified minimum length are considered in all subsequent analyses. Default is |
minlenltr | minimum LTR length. Default is |
maxlenltr | maximum LTR length. Default is |
mindistltr | minimum distance of LTR starting positions. Default is |
maxdistltr | maximum distance of LTR starting positions. Default is |
similar | minimum similarity value between the two LTRs in percent. |
mintsd | minimum target site duplications (TSDs) length. If no search for TSDs
shall be performed, then specify |
maxtsd | maximum target site duplications (TSDs) length. If no search for TSDs
shall be performed, then specify |
vic | number of nucleotide positions left and right (the vicinity) of the predicted
boundary of a LTR that will be searched for TSDs and/or one motif (if specified).
Default is |
overlaps | specify how overlapping LTR retrotransposon predictions shall be treated.
If |
xdrop | specify the xdrop value (> 0) for extending a seed repeat in both directions
allowing for matches, mismatches, insertions, and deletions. The xdrop extension process
stops as soon as the extension involving matches, mismatches, insersions, and deletions
has a score smaller than T - X, where T denotes the largest score seen so far. Default is |
mat | specify the positive match score for the X-drop extension process. Default is |
mis | specify the negative mismatch score for the X-drop extension process. Default is |
ins | specify the negative insertion score for the X-drop extension process. Default is |
del | specify the negative deletion score for the X-drop extension process. Default is |
motif | specify 2 nucleotides for the starting motif and 2 nucleotides for the ending
motif at the beginning and the ending of each LTR, respectively.
Only palindromic motif sequences - where the motif sequence is equal to its complementary
sequence read backwards - are allowed, e.g. |
motifmis | allowed number of mismatches in the TSD motif specified in |
aaout | shall the protein sequence of the HMM matches to the predicted LTR transposon
be generated as fasta file or not. Options are |
aliout | shall the alignment of the protein sequence of the HMM matches to the predicted LTR transposon
be generated as fasta file or not. Options are |
pptlen | a two dimensional numeric vector specifying the minimum and maximum allowed
lengths for PPT predictions. If a purine-rich region that does not fulfill this range is
found, it will be discarded. Default is |
uboxlen | a two dimensional numeric vector specifying the minimum and maximum allowed
lengths for U-box predictions. If a T-rich region preceding a PPT that does not fulfill the PPT length criteria is
found, it will be discarded. Default is |
pptradius | a numeric value specifying the area around the 3' LTR beginning to be
considered when searching for PPT. Default value is |
trnas | path to the fasta file storing the unique tRNA sequences that shall be matched to the predicted LTR transposon (tRNA library). |
pbsalilen | a two dimensional numeric vector specifying the minimum and maximum allowed
lengths for PBS/tRNA alignments. If the local alignments are shorter or longer than this
range, it will be discarded. Default is |
pbsoffset | a two dimensional numeric vector specifying the minimum and maximum allowed
distance between the start of the PBS and the 3' end of the 5' LTR. Local alignments not
fulfilling this criteria will be discarded. Default is |
pbstrnaoffset | a two dimensional numeric vector specifying the minimum and maximum allowed
PBS/tRNA alignment offset from the 3' end of the tRNA. Local alignments not
fulfilling this criteria will be discarded. Default is |
pbsmaxedist | a numeric value specifying the maximal allowed unit edit distance in a local PBS/tRNA alignment. |
pbsradius | a numeric value specifying the area around the 5' LTR end to be
considered when searching for PBS Default value is |
hmms | a character string or a character vector storing either the hmm files for
searching internal domains between the LTRs of predicted LTR transposons or a vector of
Pfam IDs from http://pfam.xfam.org/ that are downloaded and used to search for corresponding protein domains
within the predicted LTR transposons. As an option users can rename all of their hmm files
so that they start for example with the name |
pdomevalcutoff | a numeric value specifying the E-value cutoff for corresponding HMMER searches. All hits that do not fulfill this criteria are discarded. Default is |
pbsmatchscore | specify the match score used in the PBS/tRNA Smith-Waterman alignment.
Default is |
pbsmismatchscore | specify the mismatch score used in the PBS/tRNA Smith-Waterman alignment.
Default is |
pbsinsertionscore | specify the insertion score used in the PBS/tRNA Smith-Waterman alignment.
Default is |
pbsdeletionscore | specify the deletion score used in the PBS/tRNA Smith-Waterman alignment.
Default is |
pfam.ids | a character vector storing the Pfam IDs from http://pfam.xfam.org/ that shall be downloaded and used to perform protein domain searches within the sequences between the predicted LTRs. |
cores | number of cores to be used for multicore processing. |
dfam.cores | number of cores to be used for multicore processing when running Dfam query (in case |
hmm.cores | number of cores to be used for multicore processing when performing hmmer protein search with |
orf.style | type of predicting open reading frames (see documentation of USEARCH). |
min.codons | minimum number of codons in the predicted open reading frame. |
trans.seqs | logical value indicating wheter or not predicted open reading frames shall be translated and the corresponding protein sequences stored in the output folder. |
output.path | a path/folder to store all results returned by |
quality.filter | shall false positives be filtered out as much as possible? Default is |
n.orfs | minimum number of Open Reading Frames that must be found between the LTRs (if |
job_num | a job number in case this function is run in parallel mode in |
verbose | shall further information be printed on the console or not. |
The LTRpred
function generates the following output files:
*_BetweenLTRSeqs.fsa : DNA sequences predicted LTR retrotransposons, in particular of the region between the LTRs in fasta format.
*_Details.tsv : A spread sheet containing detailed information about the predicted LTRs.
*_FullLTRRetrotransposonSeqs.fsa : DNA sequences of the entire predicted LTR retrotransposon.
*_index.fsa.* : The suffixarray index file used to predict putative LTR retrotransposons.
*_Prediction.gff : A spread sheet containing detailed additional information about the predicted LTRs (partially redundant with the *_Details.tsv file).
*_index_ltrdigest.fsa : The suffixarray index file used to predict putative LTR retrotransposonswith LTRdigest
.
*_LTRdigestPrediction.gff : A spread sheet containing detailed information about the predicted LTRs.
*-ltrdigest_tabout.csv : A spread sheet containing additional detailed information about the predicted LTRs.
*-ltrdigest_complete.fas : The full length DNA sequences of all predicted LTR transposons.
*-ltrdigest_conditions.csv : Contains information about the parameters used for a given
LTRdigest
run.
*-ltrdigest_pbs.fas : Stores the predicted PBS sequences for the putative LTR retrotransposons.
*-ltrdigest_ppt.fas : Stores the predicted PPT sequences for the putative LTR retrotransposons.
*-ltrdigest_5ltr.fas and *-ltrdigest_3ltr.fas: Stores the predicted 5' and 3' LTR sequences. Note: If the direction of the putative retrotransposon could be predicted, these
files will contain the corresponding 3' and 5' LTR sequences. If no direction could be
predicted, forward direction with regard to the original sequence will be assumed by
LTRdigest
, i.e. the 'left' LTR will be considered the 5' LTR.
*-ltrdigest_pdom_<domainname>.fas : Stores the DNA sequences of the HMM matches to the LTR retrotransposon candidates.
*-ltrdigest_pdom_<domainname>_aa.fas : Stores the concatenated protein sequences of the HMM matches to the LTR retrotransposon candidates.
*-ltrdigest_pdom_<domainname>_ali.fas : Stores the alignment information for all matches of the given protein domain model to the translations of all candidates.
*_ORF_prediction_nt.fsa : Stores the predicted open reading frames within the predicted LTR transposons as DNA sequence.
*_ORF_prediction_aa.fsa : Stores the predicted open reading frames within the predicted LTR transposons as protein sequence.
*_LTRpred.gff : Stores the LTRpred
predicted LTR transposons in GFF format.
*_LTRpred.bed : Stores the LTRpred
predicted LTR transposons in BED format.
*_LTRpred_DataSheet.tsv : Stores the output table as data sheet.
The ' * ' is an place holder for the name of the input genome file.
If annotate = "Dfam" In case the Dfam annotation option is specified the following additional files are generated and stored in the LTRpred result folder:
*-ltrdigest_complete.fas_DfamAnnotation.out : a data table storing the output of the HMMer search of predicted retrotransposons against the Dfam database.
if cluster = TRUE
*-ltrdigest_complete.fas_CLUSTpred.blast6out : usearch cluster result in BLAST table format.
*-ltrdigest_complete.fas_CLUSTpred.log : log file of usearch run.
*-ltrdigest_complete.fas_CLUSTpred.uc : usearch cluster output file.
This function provides the main pipeline to perform de novo
LTR transposon
predictions.
Quality Control
ltr.similarity
: Minimum similarity between LTRs. All TEs not matching this
criteria are discarded.
n.orfs
: minimum number of Open Reading Frames that must be found between the
LTRs. All TEs not matching this criteria are discarded.
PBS or Protein Match
: elements must either have a predicted Primer Binding
Site or a protein match of at least one protein (Gag, Pol, Rve, ...) between their LTRs. All TEs not matching this criteria are discarded.
The relative number of N's (= nucleotide not known) in TE <= 0.1. The relative number of N's is computed as follows: absolute number of N's in TE / width of TE.
R Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics (2010) 26 (19): 2460-2461.
D Ellinghaus, S Kurtz and U Willhoeft. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics (2008). 9:18.
S Steinbiss et al. Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucl. Acids Res. (2009) 37 (21): 7002-7013.
Hajk-Georg Drost
if (FALSE) { # generate de novo LTR transposon prediction LTRpred(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"), trnas = system.file("hg38-tRNAs.fa", package = "LTRpred"), hmms = paste0(system.file("HMMs/", package = "LTRpred"),"hmm_*")) # run LTRpred with pre-computed predictions from LTRdigest() genome <- system.file("TAIR10_chr_all_LTRdigestPrediction.gff",package = "LTRpred") tabout <- system.file("TAIR10_chr_all-ltrdigest_tabout.csv",package = "LTRpred") orf.pred <- system.file("nt.fa",package = "LTRpred") LTRpred(LTRdigest.gff = genome, tabout.file = tabout, orf.file = orf.pred) }