readUc: Convert .uc Files to Dataframe

View source: R/readUc.R

readUcR Documentation

Convert .uc Files to Dataframe

Description

Reads .uc files (USEARCH Cluster Format) generated by the VSEARCH clustering and alignment algorithms.

Usage

readUc(file, output = "cluster")

Arguments

file

The file path of the .uc file.

output

The type of analysis that was carried out to produce the .uc file.

  • If output is specified as "cluster", VSEARCH clustering was carried out.

  • If output is specified as "alignment", VSEARCH pairwise global alignment was carried out.

Note that clustering produces one "H" record for each sequence, and one "C" record for each cluster, while an alignment produces an "H" record for each alignment (see details).

Details

USEARCH cluster format is a tab separated text file that contains clustering and/or alignment information for a set of sequences. For each sequence a record type, "H, C or N", is provided providing information about the type of "hit" in the dataframe. These refer to:

  • H - Hit - for alignments, indicates an identified alignment of two supplied sequences. For clustering, indicates the cluster assignment for a query.

  • C - Cluster record - a record for each cluster generated.

  • N - No hit - indicates that no cluster was assigned or no alignment was found with a target sequence. For clustering, a query with no hits becomes the centroid of a new cluster.

Additionally, for each record a "compressed alignment" is generated. This is the alignment represented in a compact format including the letters "M", "D", and "I". Before each letter, the number of consecutive columns of the given letter type is also given. The letter types are as follows:

  • "M" - Match - Identical bases between the query and target sequence

  • "D" - Deletion - A gap in the target sequence

  • "I" - Insertion - A gap in the query sequence

An example of this would be "13M", referring to 13 consecutive matches between the query and target sequence.

Value

A dataframe containing the converted .uc file. The fields contained within are as follows:

  • Record type - "H, C or N", see details for further information.

  • Cluster designation (output = "cluster" only)

  • Sequence length, or cluster size

  • Percent identity to target

  • The nucleotide strand (output = "cluster" only)

  • A compressed alignment - see details for further information.

  • ID of query sequence

  • ID of target sequence ("H" records only)

Author(s)

Jack Gisby

References

VSEARCH may be downloaded from https://github.com/torognes/vsearch. See https://www.ncbi.nlm.nih.gov/pubmed/27781170 for further information.

See Also

codetirClust, codepackAlign, codereadBlast, codepackClust

Examples

readUc(system.file(
    "extdata", 
    "packMatches.uc", 
    package = "packFinder"
))


jackgisby/packFinder documentation built on July 19, 2022, 2:25 a.m.