A helper function to apply the quality.filter function to diverse LTRpred annotations while probing different ltr similarity thresholds.

generate.multi.quality.filter.meta(
  kingdom,
  genome.folder,
  ltrpred.meta.folder,
  sim.options,
  cut.range.options,
  n.orfs = 0,
  strategy = "default",
  update = FALSE
)

Arguments

kingdom

the taxonomic kingdom of the species for which LTRpred annotations are stored in the genome.folder.

genome.folder

a file path to a folder storing the genome assembly files in fasta format that were used to generate LTRpred annotations of diverse species from the same taxonomic kingdom.

ltrpred.meta.folder

a file path to a folder storing LTRpred annotations of diverse species from the same taxonomic kingdom.

sim.options

a numeric vector storing the ltr similarity thresholds that shall be probed.

cut.range.options

a numeric vector storing the similarity cut range thresholds that shall be probed.

n.orfs

minimum number of open reading frames a predicted retroelement shall possess.

strategy

quality filter strategy. Options are

  • strategy = "default" : see section Quality Control

  • strategy = "stringent" : in addition to filter criteria specified in section Quality Control, the filter criteria !is.na(protein_domain)) | (dfam_target_name != "unknown") is applied

update

shall already existing _SimilarityMatrix.csv and _GenomeInfo.csv files be updated (update = TRUE) or can the already existing files be used (update = FALSE)?

Value

A list with to list elements sim_file and gm_file. Each list element stores a data.frame:

  • sim_file (similarity file)

  • gm_file (genome metrics file)

Details

Quality Control

  • ltr.similarity: Minimum similarity between LTRs. All TEs not matching this criteria are discarded.

  • n.orfs: minimum number of Open Reading Frames that must be found between the LTRs. All TEs not matching this criteria are discarded.

  • PBS or Protein Match: elements must either have a predicted Primer Binding Site or a protein match of at least one protein (Gag, Pol, Rve, ...) between their LTRs. All TEs not matching this criteria are discarded.

  • The relative number of N's (= nucleotide not known) in TE <= 0.1. The relative number of N's is computed as follows: absolute number of N's in TE / width of TE.

See also

Author

Hajk-Georg Drost