Run LTRdigest to predict putative LTR Retrotransposons

This function implements an interface between R and the LTRdigest command line tool to predict putative LTR retrotransposons from R.

LTRdigest(
  input.gff3,
  genome.file,
  aaout = "yes",
  aliout = "yes",
  pptlen = c(8, 30),
  uboxlen = c(3, 30),
  pptradius = 30,
  trnas = NULL,
  pbsalilen = c(11, 30),
  pbsoffset = c(0, 5),
  pbstrnaoffset = c(0, 5),
  pbsmaxedist = 1,
  pbsradius = 30,
  hmms = NULL,
  pdomevalcutoff = 1e-05,
  pbsmatchscore = 5,
  pbsmismatchscore = -10,
  pbsinsertionscore = -20,
  pbsdeletionscore = -20,
  pfam.ids = NULL,
  cores = 1,
  index.file = NULL,
  output.path = NULL
)

Arguments

input.gff3	path to the prediction file in gff3 format returned by `LTRharvest`.
genome.file	path to the genome file in `fasta` format.
aaout	shall the protein sequence of the HMM matches to the predicted LTR transposon be generated as fasta file or not. Options are `aaout = "yes"` or `aaout = "no"`.
aliout	shall the alignment of the protein sequence of the HMM matches to the predicted LTR transposon be generated as fasta file or not. Options are `aaout = "yes"` or `aaout = "no"`.
pptlen	a two dimensional numeric vector specifying the minimum and maximum allowed lengths for PPT predictions. If a purine-rich region that does not fulfill this range is found, it will be discarded. Default is `pptlen = c(8,30)` (minimum = 8; maximum = 30).
uboxlen	a two dimensional numeric vector specifying the minimum and maximum allowed lengths for U-box predictions. If a T-rich region preceding a PPT that does not fulfill the PPT length criteria is found, it will be discarded. Default is `uboxlen = c(3,30)` (minimum = 3; maximum = 30).
pptradius	a numeric value specifying the area around the 3' LTR beginning to be considered when searching for PPT. Default value is `pptradius = 30`.
trnas	path to the fasta file storing the unique tRNA sequences that shall be matched to the predicted LTR transposon (tRNA library).
pbsalilen	a two dimensional numeric vector specifying the minimum and maximum allowed lengths for PBS/tRNA alignments. If the local alignments are shorter or longer than this range, it will be discarded. Default is `pbsalilen = c(11,30)` (minimum = 11; maximum = 30).
pbsoffset	a two dimensional numeric vector specifying the minimum and maximum allowed distance between the start of the PBS and the 3' end of the 5' LTR. Local alignments not fulfilling this criteria will be discarded. Default is `pbsoffset = c(0,5)` (minimum = 0; maximum = 5).
pbstrnaoffset	a two dimensional numeric vector specifying the minimum and maximum allowed PBS/tRNA alignment offset from the 3' end of the tRNA. Local alignments not fulfilling this criteria will be discarded. Default is `pbstrnaoffset = c(0,5)` (minimum = 0; maximum = 5).
pbsmaxedist	a numeric value specifying the maximal allowed unit edit distance in a local PBS/tRNA alignment.
pbsradius	a numeric value specifying the area around the 5' LTR end to be considered when searching for PBS Default value is `pbsradius = 30`.
hmms	a character string or a character vector storing either the hmm files for searching internal domains between the LTRs of predicted LTR transposons or a vector of Pfam IDs from http://pfam.xfam.org/ that are downloaded and used to search for corresponding protein domains within the predicted LTR transposons. As an option users can rename all of their hmm files so that they start for example with the name `hmms = "hmm_*"`. This way all files starting with `hmm_` will be considered for the subsequent protein domain search. In case Pfam IDs are specified, the `LTRpred` function will automatically download the corresponding HMM files and use them for further protein domain searches. In case users prefer to specify Pfam IDs please specify them in the `pfam.ids` parmeter and choose `hmms = NULL`.
pdomevalcutoff	a numeric value specifying the E-value cutoff for corresponding HMMER searches. All hits that do not fulfill this criteria are discarded. Default is `pdomevalcutoff = 1E-5`.
pbsmatchscore	specify the match score used in the PBS/tRNA Smith-Waterman alignment. Default is `pbsmatchscore = 5`.
pbsmismatchscore	specify the mismatch score used in the PBS/tRNA Smith-Waterman alignment. Default is `pbsmismatchscore = -10`.
pbsinsertionscore	specify the insertion score used in the PBS/tRNA Smith-Waterman alignment. Default is `pbsinsertionscore = -20`.
pbsdeletionscore	specify the deletion score used in the PBS/tRNA Smith-Waterman alignment. Default is `pbsdeletionscore = -20`.
pfam.ids	a character vector storing the Pfam IDs from http://pfam.xfam.org/ that shall be downloaded and used to perform protein domain searches within the sequences between the predicted LTRs.
cores	number of cores to be used for multicore processing.
index.file	specify the name of the enhanced suffix array index file that is computed by `suffixerator`. This opten can be used in case the suffix file was previously generated, e.g. during a previous call of this function. In this case the suffix array index file does not need to be re-computed for new analyses. This is particularly useful when running `LTRdigest` with different parameter settings.
output.path	a path/folder to store all results returned by `LTRdigest`. If `output.path = NULL` (Default) then a folder with the name of the input genome file will be generated in the current working directory of R and all results are then stored in this folder.

Value

The LTRdigest function generates the following output files:

*_index_ltrdigest.fsa : The suffixarray index file used to predict putative LTR retrotransposonswith LTRdigest.
*_LTRdigestPrediction.gff : A spread sheet containing detailed information about the predicted LTRs.
*-ltrdigest_tabout.csv : A spread sheet containing additional detailed information about the predicted LTRs.
*-ltrdigest_complete.fas : The full length DNA sequences of all predicted LTR transposons.
*-ltrdigest_conditions.csv : Contains information about the parameters used for a given LTRdigest run.
*-ltrdigest_pbs.fas : Stores the predicted PBS sequences for the putative LTR retrotransposons.
*-ltrdigest_ppt.fas : Stores the predicted PPT sequences for the putative LTR retrotransposons.
*-ltrdigest_5ltr.fas and *-ltrdigest_3ltr.fas: Stores the predicted 5' and 3' LTR sequences. Note: If the direction of the putative retrotransposon could be predicted, these files will contain the corresponding 3' and 5' LTR sequences. If no direction could be predicted, forward direction with regard to the original sequence will be assumed by LTRdigest, i.e. the 'left' LTR will be considered the 5' LTR.
*-ltrdigest_pdom_<domainname>.fas : Stores the DNA sequences of the HMM matches to the LTR retrotransposon candidates.
*-ltrdigest_pdom_<domainname>_aa.fas : Stores the concatenated protein sequences of the HMM matches to the LTR retrotransposon candidates.
*-ltrdigest_pdom_<domainname>_ali.fas : Stores the alignment information for all matches of the given protein domain model to the translations of all candidates.

The ' * ' is an place holder for the name of the input genome file.

Details

The LTRdigest function is a wrapper function to work with the call the LTRdigest command line tool from R.

References

S Steinbiss et al. Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucl. Acids Res. (2009) 37 (21): 7002-7013.

Author

Hajk-Georg Drost

Examples

if (FALSE) {
# Run LTRharvest for Arabidopsis thaliana using standard parameters
LTRharvest(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"))

# Run LTRdigest for Arabidopsis thaliana using standard parameters
LTRdigest(input.gff3  = "Hsapiens_ChrY_ltrharvest/Hsapiens_ChrY_Prediction.gff", 
          genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"),
          trnas       = system.file("hg38-tRNAs.fa", package = "LTRpred"),
          hmms        = paste0(system.file("HMMs/", package = "LTRpred"),"hmm_*"))
}