This function takes the file paths to the genomes folder and LTRpred.meta output folder as input and eliminates false positive retrotransposon predictions on a metagenomic scale.

quality.filter.meta(
  kingdom,
  genome.folder,
  ltrpred.meta.folder,
  sim,
  cut.range = 2,
  n.orfs,
  strategy,
  update = FALSE
)

Arguments

kingdom

a character string specifying the kingdom of life to which genomes annotated with LTRpred.meta belong to. E.g. kingdom = "Plants". If annotates of a variety of kingdoms have been done users can for example specify kingdom = "Various".

genome.folder

path to folder storing the genome assembly files that were used for LTRpred.meta predictions.

ltrpred.meta.folder

path to folder storing the LTRpred.meta output files.

sim

LTR similarity threshold. Only putative LTR transposons that fulfill this LTR similarity threshold will be retained.

cut.range

a numeric number indicating the interval size for binning LTR similarities.

n.orfs

minimum number of ORFs detected in the putative LTR transposon.

strategy

quality filter strategy. Options are

  • strategy = "default" : see section Quality Control

  • strategy = "stringent" : in addition to filter criteria specified in section Quality Control, the filter criteria !is.na(protein_domain)) | (dfam_target_name != "unknown") is applied

update

shall already existing _SimilarityMatrix.csv and _GenomeInfo.csv files be updated (update = TRUE) or can the already existing files be used (update = FALSE)?

Value

A list with to list elements sim_file and gm_file. Each list element stores a data.frame:

  • sim_file (similarity file)

  • gm_file (genome metrics file)

Details

Quality Control

  • ltr.similarity: Minimum similarity between LTRs. All TEs not matching this criteria are discarded.

  • n.orfs: minimum number of Open Reading Frames that must be found between the LTRs. All TEs not matching this criteria are discarded.

  • PBS or Protein Match: elements must either have a predicted Primer Binding Site or a protein match of at least one protein (Gag, Pol, Rve, ...) between their LTRs. All TEs not matching this criteria are discarded.

  • The relative number of N's (= nucleotide not known) in TE <= 0.1. The relative number of N's is computed as follows: absolute number of N's in TE / width of TE.

See also

Author

Hajk-Georg Drost