FormGroups: Forms Groups By Rank

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/FormGroups.R

Description

Agglomerates sequences into groups within a specified size range based on taxonomic rank.

Usage

1
2
3
4
5
6
7
8
FormGroups(dbFile,
           tblName = "Seqs",
           goalSize = 50,
           minGroupSize = 25,
           maxGroupSize = 5000,
           includeNames = FALSE,
           add2tbl = FALSE,
           verbose = TRUE)

Arguments

dbFile

A SQLite connection object or a character string specifying the path to the database file.

tblName

Character string specifying the table where the rank information is located.

goalSize

Number of sequences required in each group to stop adding more sequences.

minGroupSize

Minimum number of sequences in each group required to stop trying to recombine with a larger group.

maxGroupSize

Maximum number of sequences in each group allowed to continue agglomeration.

includeNames

Logical indicating whether to include the formal scientific name in the group name.

add2tbl

Logical or a character string specifying the table name in which to add the result.

verbose

Logical indicating whether to display progress.

Details

FormGroups uses the “rank” field in the dbFile table to group sequences with similar taxonomic rank. Rank information must be present in the tblName, such as that created by default when importing sequences from a GenBank formatted file.

Rank information contains the formal scientific name on the first line, followed by the taxonomic lineage on subsequent lines. When includeNames is TRUE the formal scientific name is appended to the end of the group name, otherwise only the taxonomic lineage is used as the group name.

The algorithm ascends the taxonomic tree, agglomerating taxa into groups until the goalSize is reached. If the group size is below minGroupSize then further agglomeration is attempted with a larger group. If additional agglomeration results in a group larger than maxGroupSize then the agglomeration is undone so that the group is smaller. Setting minGroupSize to goalSize avoids the creation of polyphyletic groups. Note that this approach may often result in paraphyletic groups.

Value

A data.frame with the rank and corresponding group name as identifier. Note that quotes are stripped from group names to prevent problems that they may cause. The origin gives the rank preceding the identifier. The count denotes number of sequences corresponding to each rank. If add2tbl is not FALSE then the “identifier” and “origin” columns are updated in dbFile.

Author(s)

Erik Wright eswright@pitt.edu

See Also

IdentifyByRank

Examples

1
2
3
4
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
g <- FormGroups(db, goalSize=10, minGroupSize=5, maxGroupSize=20)
head(g)
tapply(g$count, g$identifier, sum)

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.