Read a file in USEARCH cluster format generated by either USEARCH or VSEARCH.
read.uc(uc.file)
uc.file | path to file in USEARCH cluster format ( |
---|
A dataframe storing the following columns:
Type:
Record type 'S', 'H', 'C', or 'N'.
Cluster:
Cluster number (0-based).
Size:
Sequence length ('S', 'N', and 'H') or cluster size 'C'.
Perc_Ident:
For 'H' records, percent identity with target.
Strand:
For 'H' records, the strand: '+' or '-' for nucleotides; '.' for proteins.
Query:
query id.
Target:
target id.
Details:
Record type:
Type 'H' :
Hit. Represents an alignment between the query sequence and target sequence. For clustering 'H' indicates the cluster assignment for the query.
Type 'S' :
Centroid (clustering only). There exists only one 'S' record
for each cluster, this gives the centroid (representative) sequence label in the Query
column.
Type 'C' :
Cluster record (clustering only). The Size
column specifies the cluster size and the Query
column the query id that corresponds to this cluster.
Type 'N' :
No hit (for database search without clustering only). Indicates that no hit of the query were found in the target database. In the case of clustering, a query without hits becomes the centroid of a new cluster and generates an 'S'
record instead of an 'N' record.
Hajk-Georg Drost
# read example *.uc file test.uc <- read.uc(system.file("test.uc", package = "LTRpred")) # look at the format in R head(test.uc) #> # A tibble: 6 × 7 #> Type Cluster Size Perc_Ident Strand Query Target #> <chr> <int> <int> <chr> <chr> <chr> <chr> #> 1 S 0 24827 * * 2_CHROMOSOME_dumped__3426174_345… * #> 2 S 1 24521 * * 5_CHROMOSOME_dumped__2380969_240… * #> 3 S 2 24339 * * 3_CHROMOSOME_dumped__1657651_168… * #> 4 S 3 23686 * * 3_CHROMOSOME_dumped__14117876_14… * #> 5 S 4 23241 * * 3_CHROMOSOME_dumped__14762490_14… * #> 6 S 5 22083 * * 3_CHROMOSOME_dumped__15779153_15… *