Read a file in USEARCH cluster format generated by either USEARCH or VSEARCH.

read.uc(uc.file)

Arguments

uc.file

path to file in USEARCH cluster format (*.uc file extension).

Value

A dataframe storing the following columns:

  • Type: Record type 'S', 'H', 'C', or 'N'.

  • Cluster: Cluster number (0-based).

  • Size: Sequence length ('S', 'N', and 'H') or cluster size 'C'.

  • Perc_Ident: For 'H' records, percent identity with target.

  • Strand: For 'H' records, the strand: '+' or '-' for nucleotides; '.' for proteins.

  • Query: query id.

  • Target: target id.

Details:

Record type:

  • Type 'H' : Hit. Represents an alignment between the query sequence and target sequence. For clustering 'H' indicates the cluster assignment for the query.

  • Type 'S' : Centroid (clustering only). There exists only one 'S' record for each cluster, this gives the centroid (representative) sequence label in the Query column.

  • Type 'C' : Cluster record (clustering only). The Size column specifies the cluster size and the Query column the query id that corresponds to this cluster.

  • Type 'N' : No hit (for database search without clustering only). Indicates that no hit of the query were found in the target database. In the case of clustering, a query without hits becomes the centroid of a new cluster and generates an 'S' record instead of an 'N' record.

Author

Hajk-Georg Drost

Examples

# read example *.uc file
test.uc <- read.uc(system.file("test.uc", package = "LTRpred"))

# look at the format in R
head(test.uc)
#> # A tibble: 6 × 7
#>   Type  Cluster  Size Perc_Ident Strand Query                             Target
#>   <chr>   <int> <int> <chr>      <chr>  <chr>                             <chr> 
#> 1 S           0 24827 *          *      2_CHROMOSOME_dumped__3426174_345… *     
#> 2 S           1 24521 *          *      5_CHROMOSOME_dumped__2380969_240… *     
#> 3 S           2 24339 *          *      3_CHROMOSOME_dumped__1657651_168… *     
#> 4 S           3 23686 *          *      3_CHROMOSOME_dumped__14117876_14… *     
#> 5 S           4 23241 *          *      3_CHROMOSOME_dumped__14762490_14… *     
#> 6 S           5 22083 *          *      3_CHROMOSOME_dumped__15779153_15… *