Determine the Number of Bases, Nonbases, and Width of Each Sequence

Share:

Description

Counts the number of bases (A, C, G, T) and ambiguities/degeneracies in each sequence.

Usage

1
2
3
4
5
6
7
8
IdLengths(dbFile,
          tblName = "Seqs",
          identifier = "",
          type = "DNAStringSet",
          add2tbl = FALSE,
          batchSize = 10000,
          processors = 1,
          verbose = TRUE)

Arguments

dbFile

A SQLite connection object or a character string specifying the path to the database file.

tblName

Character string specifying the table where the sequences are located.

identifier

Optional character string used to narrow the search results to those matching a specific identifier. If "" then all identifiers are selected.

type

The type of XStringSet being processed. This should be (an abbreviation of) one of "DNAStringSet" or "RNAStringSet".

add2tbl

Logical or a character string specifying the table name in which to add the result.

batchSize

Integer specifying the number of sequences to process at a time.

processors

The number of processors to use, or NULL to automatically detect and use all available processors.

verbose

Logical indicating whether to display progress.

Value

A data.frame with the number of bases (“A”, “C”, “G”, or “T”), nonbases, and width of each sequence. The width is defined as the sum of bases and nonbases in each sequence. The row.names of the data.frame correspond to the "row_names" in the tblName of the dbFile.

Author(s)

Erik Wright DECIPHER@cae.wisc.edu

References

ES Wright (2016) "Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R". The R Journal, 8(1), 352-359.

See Also

Add2DB

Examples

1
2
3
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
l <- IdLengths(db)
head(l)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.