Main pipeline to perform sufficient LTR retrotransposon predictions for any genome of interest.

LTRpred(
  genome.file = NULL,
  centromere_start = NULL,
  centromere_end = NULL,
  ltr_age_estimation = TRUE,
  model = "K80",
  mutation_rate = 1.3 * 1e-07,
  index.file.harvest = NULL,
  index.file.digest = NULL,
  LTRdigest.gff = NULL,
  tabout.file = NULL,
  LTRharvest.folder = NULL,
  LTRpred.folder = NULL,
  orf.file = NULL,
  annotate = NULL,
  Dfam.db = NULL,
  dfam.eval = 0.001,
  dfam.file = NULL,
  cluster = FALSE,
  clust.sim = 0.9,
  clust.file = NULL,
  copy.number.est = FALSE,
  fix.chr.name = FALSE,
  cn.eval = 1e-05,
  range = c(0, 0),
  seed = 30,
  minlenltr = 100,
  maxlenltr = 3500,
  mindistltr = 4000,
  maxdistltr = 25000,
  similar = 70,
  mintsd = 4,
  maxtsd = 20,
  vic = 60,
  overlaps = "no",
  xdrop = 5,
  mat = 2,
  mis = -2,
  ins = -3,
  del = -3,
  motif = NULL,
  motifmis = 0,
  aaout = "yes",
  aliout = "yes",
  pptlen = c(8, 30),
  uboxlen = c(3, 30),
  pptradius = 30,
  trnas = NULL,
  pbsalilen = c(11, 30),
  pbsoffset = c(0, 5),
  pbstrnaoffset = c(0, 5),
  pbsmaxedist = 1,
  pbsradius = 30,
  hmms = NULL,
  pdomevalcutoff = 1e-05,
  pbsmatchscore = 5,
  pbsmismatchscore = -10,
  pbsinsertionscore = -20,
  pbsdeletionscore = -20,
  pfam.ids = NULL,
  cores = 1,
  dfam.cores = NULL,
  hmm.cores = NULL,
  orf.style = 7,
  min.codons = 200,
  trans.seqs = FALSE,
  output.path = NULL,
  quality.filter = TRUE,
  n.orfs = 0,
  job_num = 1,
  verbose = TRUE
)

Arguments

genome.file

path to the genome file in fasta format.

centromere_start

a numeric vector storing the centromere start coordinates in the genome.file. The position in the numeric vector should correspond to the chromosome name in the genome.file fasta file. If centromere_start = NULL (default), then no centromeres will be drawn.

centromere_end

a numeric vector storing the centromere end coordinates in the genome.file. The position in the numeric vector should correspond to the chromosome name in the genome.file fasta file. If centromere_end = NULL (default), then no centromeres will be drawn.

ltr_age_estimation

a logical value indicating wether or not ltr-age estimation shall be performed (default ltr_age_estimation = TRUE).

model

a model as specified in dist.dna: a character string specifying the evolutionary model to be used - must be one of

  • K80 (the default)

  • raw

  • N

  • TS

  • TV

  • JC69

  • F81

  • K81

  • F84

  • BH87

  • T92

  • TN93

  • GG95

  • logdet

  • paralin

mutation_rate

a mutation rate per site per year. For retrotransposons the default is \(mutation_rate = 1.3 * 10E-8\) (Wicker and Keller, 2007).

index.file.harvest

specify the name of the enhanced suffix array index file that is computed by suffixerator for the use of LTRharvest. This often can be used in case the suffix file was previously generated, e.g. during a previous call of this function. In this case the suffix array index file does not need to be re-computed for new analyses. This is particularly useful when running LTRpred with different parameter settings. A possible index file name could be BASENAME_index.fsa.

index.file.digest

specify the name of the enhanced suffix array index file that is computed by suffixerator for the use of LTRdigest. This often can be used in case the suffix file was previously generated, e.g. during a previous call of this function. In this case the suffix array index file does not need to be re-computed for new analyses. This is particularly useful when running LTRpred with different parameter settings. A possible index file name could be BASENAME_index_ltrdigest.fsa.

LTRdigest.gff

path to the LTRdigest generated GFF file, in case LTRdigest files were pre-computed previously.

tabout.file

path to the LTRdigest generated tabout file file, in case LTRdigest files were pre-computed previously.

LTRharvest.folder

either LTRharvest.folder = NULL (default) or LTRharvest.folder = "skip" to skip moving a LTRharvest folder if it is not present.

LTRpred.folder

name/path of/to an existing LTRpred folder.

orf.file

path to the file generated by ORFpred, in case the orf prediction file was generated previously.

annotate

annotation database that shall be queried to annotate predicted LTR transposons. Default is annotate = NULL indicating that no annotation query is being performed. Possible options are: annotate = "Dfam" (here the Dfam database must be stored locally and a nhammer search is performed against the Dfam database) or annotate = "Repbase" (here the Repbase database must be stored locally and a blastn search is performed against the Repbase database). Please consult the vignettes for details.

Dfam.db

folder path to the local Dfam database or Dfam.db = "download" in case the Dfam database shall be automatically downloaded before performing query analyses.

dfam.eval

E-value threshhold to perform HMMer search against the Dfam database.

dfam.file

path to pre-computed dfam.query output file. Can only be used in combination with annotate = "Dfam".

cluster

shall predicted transposons be clustered with CLUSTpred.

clust.sim

cluster reject if sequence similarity is lower than this threshold when performing clustering with CLUSTpred.

clust.file

file path to pre-computed clustering file generated by CLUSTpred.

copy.number.est

shall copy number estimation (including solo LTR prediction) be performed? Default is copy.number.est = FALSE.

fix.chr.name

shall chromosome names be fixed?

cn.eval

evalue for copy number estimation (BLAST hit threshold).

range

define the genomic interval in which predicted LTR transposons shall be reported . In case range[1] = 1000 and range[2] = 10000 then candidates are only reported if they start after position 1000 and end before position 10000 in their respective sequence coordinates. If range[1] = 0 and range[2] = 0, so range = c(0,0) (default) then the entire genome is being scanned.

seed

the minimum length for the exact maximal repeats. Only repeats with the specified minimum length are considered in all subsequent analyses. Default is seed = 30.

minlenltr

minimum LTR length. Default is minlenltr = 100.

maxlenltr

maximum LTR length. Default is maxlenltr = 3500.

mindistltr

minimum distance of LTR starting positions. Default is mindistltr = 4000.

maxdistltr

maximum distance of LTR starting positions. Default is maxdistltr = 25000.

similar

minimum similarity value between the two LTRs in percent. similar = 70.

mintsd

minimum target site duplications (TSDs) length. If no search for TSDs shall be performed, then specify mintsd = NULL. Default is mintsd = 4.

maxtsd

maximum target site duplications (TSDs) length. If no search for TSDs shall be performed, then specify maxtsd = NULL. Default is maxtsd = 20.

vic

number of nucleotide positions left and right (the vicinity) of the predicted boundary of a LTR that will be searched for TSDs and/or one motif (if specified). Default is vic = 60.

overlaps

specify how overlapping LTR retrotransposon predictions shall be treated. If overlaps = "no" is selected, then neither nested nor overlapping predictions will be reported in the output. In case overlaps = "best" is selected then in the case of two or more nested or overlapping predictions, solely the LTR retrotransposon prediction with the highest similarity between its LTRs will be reported. If overlaps = "all" is selected then all LTR retrotransposon predictions will be reported whether there are nested and/or overlapping predictions or not. Default is overlaps = "best".

xdrop

specify the xdrop value (> 0) for extending a seed repeat in both directions allowing for matches, mismatches, insertions, and deletions. The xdrop extension process stops as soon as the extension involving matches, mismatches, insersions, and deletions has a score smaller than T - X, where T denotes the largest score seen so far. Default is xrop = 5.

mat

specify the positive match score for the X-drop extension process. Default is mat = 2.

mis

specify the negative mismatch score for the X-drop extension process. Default is mis = -2.

ins

specify the negative insertion score for the X-drop extension process. Default is ins = -3.

del

specify the negative deletion score for the X-drop extension process. Default is del = -3.

motif

specify 2 nucleotides for the starting motif and 2 nucleotides for the ending motif at the beginning and the ending of each LTR, respectively. Only palindromic motif sequences - where the motif sequence is equal to its complementary sequence read backwards - are allowed, e.g. motif = "tgca". Type the nucleotides without any space separating them. If this option is not selected by the user, candidate pairs will not be screened for potential motifs. If this options is set but no allowed number of mismatches is specified by the argument motifmis and a search for the exact motif will be conducted. If motif = NULL then no explicit motif is being specified.

motifmis

allowed number of mismatches in the TSD motif specified in motif. The number of mismatches needs to be between [0,3]. Default is motifmis = 0.

aaout

shall the protein sequence of the HMM matches to the predicted LTR transposon be generated as fasta file or not. Options are aaout = "yes" or aaout = "no".

aliout

shall the alignment of the protein sequence of the HMM matches to the predicted LTR transposon be generated as fasta file or not. Options are aaout = "yes" or aaout = "no".

pptlen

a two dimensional numeric vector specifying the minimum and maximum allowed lengths for PPT predictions. If a purine-rich region that does not fulfill this range is found, it will be discarded. Default is pptlen = c(8,30) (minimum = 8; maximum = 30).

uboxlen

a two dimensional numeric vector specifying the minimum and maximum allowed lengths for U-box predictions. If a T-rich region preceding a PPT that does not fulfill the PPT length criteria is found, it will be discarded. Default is uboxlen = c(3,30) (minimum = 3; maximum = 30).

pptradius

a numeric value specifying the area around the 3' LTR beginning to be considered when searching for PPT. Default value is pptradius = 30.

trnas

path to the fasta file storing the unique tRNA sequences that shall be matched to the predicted LTR transposon (tRNA library).

pbsalilen

a two dimensional numeric vector specifying the minimum and maximum allowed lengths for PBS/tRNA alignments. If the local alignments are shorter or longer than this range, it will be discarded. Default is pbsalilen = c(11,30) (minimum = 11; maximum = 30).

pbsoffset

a two dimensional numeric vector specifying the minimum and maximum allowed distance between the start of the PBS and the 3' end of the 5' LTR. Local alignments not fulfilling this criteria will be discarded. Default is pbsoffset = c(0,5) (minimum = 0; maximum = 5).

pbstrnaoffset

a two dimensional numeric vector specifying the minimum and maximum allowed PBS/tRNA alignment offset from the 3' end of the tRNA. Local alignments not fulfilling this criteria will be discarded. Default is pbstrnaoffset = c(0,5) (minimum = 0; maximum = 5).

pbsmaxedist

a numeric value specifying the maximal allowed unit edit distance in a local PBS/tRNA alignment.

pbsradius

a numeric value specifying the area around the 5' LTR end to be considered when searching for PBS Default value is pbsradius = 30.

hmms

a character string or a character vector storing either the hmm files for searching internal domains between the LTRs of predicted LTR transposons or a vector of Pfam IDs from http://pfam.xfam.org/ that are downloaded and used to search for corresponding protein domains within the predicted LTR transposons. As an option users can rename all of their hmm files so that they start for example with the name hmms = "hmm_*". This way all files starting with hmm_ will be considered for the subsequent protein domain search. In case Pfam IDs are specified, the LTRpred function will automatically download the corresponding HMM files and use them for further protein domain searches. In case users prefer to specify Pfam IDs please specify them in the pfam.ids parmeter and choose hmms = NULL.

pdomevalcutoff

a numeric value specifying the E-value cutoff for corresponding HMMER searches. All hits that do not fulfill this criteria are discarded. Default is pdomevalcutoff = 1E-5.

pbsmatchscore

specify the match score used in the PBS/tRNA Smith-Waterman alignment. Default is pbsmatchscore = 5.

pbsmismatchscore

specify the mismatch score used in the PBS/tRNA Smith-Waterman alignment. Default is pbsmismatchscore = -10.

pbsinsertionscore

specify the insertion score used in the PBS/tRNA Smith-Waterman alignment. Default is pbsinsertionscore = -20.

pbsdeletionscore

specify the deletion score used in the PBS/tRNA Smith-Waterman alignment. Default is pbsdeletionscore = -20.

pfam.ids

a character vector storing the Pfam IDs from http://pfam.xfam.org/ that shall be downloaded and used to perform protein domain searches within the sequences between the predicted LTRs.

cores

number of cores to be used for multicore processing.

dfam.cores

number of cores to be used for multicore processing when running Dfam query (in case annotate = "Dfam").

hmm.cores

number of cores to be used for multicore processing when performing hmmer protein search with LTRdigest.

orf.style

type of predicting open reading frames (see documentation of USEARCH).

min.codons

minimum number of codons in the predicted open reading frame.

trans.seqs

logical value indicating wheter or not predicted open reading frames shall be translated and the corresponding protein sequences stored in the output folder.

output.path

a path/folder to store all results returned by LTRharvest, LTRdigest, and LTRpred. If output.path = NULL (Default) then a folder with the name of the input genome file will be generated in the current working directory of R and all results are then stored in this folder.

quality.filter

shall false positives be filtered out as much as possible? Default is quality.filter = TRUE. See Description for details.

n.orfs

minimum number of Open Reading Frames that must be found between the LTRs (if quality.filter = TRUE). See Details for further information on quality control.

job_num

a job number in case this function is run in parallel mode in LTRpred.meta.

verbose

shall further information be printed on the console or not.

Value

The LTRpred function generates the following output files:

  • *_BetweenLTRSeqs.fsa : DNA sequences predicted LTR retrotransposons, in particular of the region between the LTRs in fasta format.

  • *_Details.tsv : A spread sheet containing detailed information about the predicted LTRs.

  • *_FullLTRRetrotransposonSeqs.fsa : DNA sequences of the entire predicted LTR retrotransposon.

  • *_index.fsa.* : The suffixarray index file used to predict putative LTR retrotransposons.

  • *_Prediction.gff : A spread sheet containing detailed additional information about the predicted LTRs (partially redundant with the *_Details.tsv file).

  • *_index_ltrdigest.fsa : The suffixarray index file used to predict putative LTR retrotransposonswith LTRdigest.

  • *_LTRdigestPrediction.gff : A spread sheet containing detailed information about the predicted LTRs.

  • *-ltrdigest_tabout.csv : A spread sheet containing additional detailed information about the predicted LTRs.

  • *-ltrdigest_complete.fas : The full length DNA sequences of all predicted LTR transposons.

  • *-ltrdigest_conditions.csv : Contains information about the parameters used for a given LTRdigest run.

  • *-ltrdigest_pbs.fas : Stores the predicted PBS sequences for the putative LTR retrotransposons.

  • *-ltrdigest_ppt.fas : Stores the predicted PPT sequences for the putative LTR retrotransposons.

  • *-ltrdigest_5ltr.fas and *-ltrdigest_3ltr.fas: Stores the predicted 5' and 3' LTR sequences. Note: If the direction of the putative retrotransposon could be predicted, these files will contain the corresponding 3' and 5' LTR sequences. If no direction could be predicted, forward direction with regard to the original sequence will be assumed by LTRdigest, i.e. the 'left' LTR will be considered the 5' LTR.

  • *-ltrdigest_pdom_<domainname>.fas : Stores the DNA sequences of the HMM matches to the LTR retrotransposon candidates.

  • *-ltrdigest_pdom_<domainname>_aa.fas : Stores the concatenated protein sequences of the HMM matches to the LTR retrotransposon candidates.

  • *-ltrdigest_pdom_<domainname>_ali.fas : Stores the alignment information for all matches of the given protein domain model to the translations of all candidates.

  • *_ORF_prediction_nt.fsa : Stores the predicted open reading frames within the predicted LTR transposons as DNA sequence.

  • *_ORF_prediction_aa.fsa : Stores the predicted open reading frames within the predicted LTR transposons as protein sequence.

  • *_LTRpred.gff : Stores the LTRpred predicted LTR transposons in GFF format.

  • *_LTRpred.bed : Stores the LTRpred predicted LTR transposons in BED format.

  • *_LTRpred_DataSheet.tsv : Stores the output table as data sheet.

The ' * ' is an place holder for the name of the input genome file.

If annotate = "Dfam" In case the Dfam annotation option is specified the following additional files are generated and stored in the LTRpred result folder:

  • *-ltrdigest_complete.fas_DfamAnnotation.out : a data table storing the output of the HMMer search of predicted retrotransposons against the Dfam database.

if cluster = TRUE

  • *-ltrdigest_complete.fas_CLUSTpred.blast6out : usearch cluster result in BLAST table format.

  • *-ltrdigest_complete.fas_CLUSTpred.log : log file of usearch run.

  • *-ltrdigest_complete.fas_CLUSTpred.uc : usearch cluster output file.

Details

This function provides the main pipeline to perform de novo LTR transposon predictions.

Quality Control

  • ltr.similarity: Minimum similarity between LTRs. All TEs not matching this criteria are discarded.

  • n.orfs: minimum number of Open Reading Frames that must be found between the LTRs. All TEs not matching this criteria are discarded.

  • PBS or Protein Match: elements must either have a predicted Primer Binding Site or a protein match of at least one protein (Gag, Pol, Rve, ...) between their LTRs. All TEs not matching this criteria are discarded.

  • The relative number of N's (= nucleotide not known) in TE <= 0.1. The relative number of N's is computed as follows: absolute number of N's in TE / width of TE.

References

R Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics (2010) 26 (19): 2460-2461.

D Ellinghaus, S Kurtz and U Willhoeft. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics (2008). 9:18.

S Steinbiss et al. Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucl. Acids Res. (2009) 37 (21): 7002-7013.

See also

Author

Hajk-Georg Drost

Examples


if (FALSE) {
# generate de novo LTR transposon prediction
LTRpred(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"),
          trnas       = system.file("hg38-tRNAs.fa", package = "LTRpred"),
          hmms        = paste0(system.file("HMMs/", package = "LTRpred"),"hmm_*"))

# run LTRpred with pre-computed predictions from LTRdigest()  
genome <- system.file("TAIR10_chr_all_LTRdigestPrediction.gff",package = "LTRpred")   
tabout <- system.file("TAIR10_chr_all-ltrdigest_tabout.csv",package = "LTRpred")
orf.pred <- system.file("nt.fa",package = "LTRpred")              
LTRpred(LTRdigest.gff = genome,
        tabout.file   = tabout,
        orf.file      = orf.pred)
}