Description Usage Arguments Details Value Author(s) References See Also Examples
Finds chimeras present in a database of sequences. Makes use of a reference database of (presumed to be) good quality sequences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | FindChimeras(dbFile,
tblName = "Seqs",
identifier = "",
dbFileReference,
tblNameReference = "Seqs",
batchSize = 100,
minNumFragments = 20000,
tb.width = 5,
multiplier = 20,
minLength = 30,
minCoverage = 0.6,
overlap = 100,
minSuspectFragments = 4,
showPercentCoverage = FALSE,
add2tbl = FALSE,
maxGroupSize = -1,
minGroupSize = 25,
excludeIDs = NULL,
processors = 1,
verbose = TRUE)
|
dbFile |
A SQLite connection object or a character string specifying the path to the database file to be checked for chimeric sequences. |
tblName |
Character string specifying the table in which to check for chimeras. |
identifier |
Optional character string used to narrow the search results to those matching a specific identifier. If "" then all identifiers are selected. |
dbFileReference |
A SQLite connection object or a character string specifying the path to the reference database file of (presumed to be) good quality sequences. A 16S reference database is available from http://DECIPHER.codes. |
tblNameReference |
Character string specifying the table with reference sequences. |
batchSize |
Number sequences to tile with fragments at a time. |
minNumFragments |
Number of suspect fragments to accumulate before searching through other groups. |
tb.width |
A single integer [1..14] giving the number of nucleotides at the start of each fragment that are part of the trusted band. |
multiplier |
A single integer specifying the multiple of fragments found out-of-group greater than fragments found in-group in order to consider a sequence a chimera. |
minLength |
Minimum length of a chimeric region in order to be considered as a chimera. |
minCoverage |
Minimum fraction of coverage necessary in a chimeric region. |
overlap |
Number of nucleotides at the end of the sequence that the chimeric region must overlap in order to be considered a chimera. |
minSuspectFragments |
Minimum number of suspect fragments belonging to another group required to consider a sequence a chimera. |
showPercentCoverage |
Logical indicating whether to list the percent coverage of suspect fragments in each chimeric region in the output. |
add2tbl |
Logical or a character string specifying the table name in which to add the result. |
maxGroupSize |
Maximum number of sequences searched in a group. A value of less than 0 means the search is unlimited. |
minGroupSize |
The minimum number of sequences in a group to be considered as part of the search for chimeras. May need to be set to a small value for reference databases with mostly small groups. |
excludeIDs |
Optional character vector of |
processors |
The number of processors to use, or |
verbose |
Logical indicating whether to display progress. |
FindChimeras
works by finding suspect fragments that are uncommon in the group where the sequence belongs, but very common in another group where the sequence does not belong. Each sequence in the dbFile
is tiled into short sequence segments called fragments. If the fragments are infrequent in their respective group in the dbFileReference
then they are considered suspect. If enough suspect fragments from a sequence meet the specified constraints then the sequence is flagged as a chimera.
The default parameters are optimized for full-length 16S sequences (> 1,000 nucleotides). Shorter 16S sequences require two parameters that are different than the defaults: minCoverage = 0.2
, and minSuspectFragments = 2
.
Groups are determined by the identifier present in each database. For this reason, the groups in the dbFile
should exist in the groups of the dbFileReference
. The reference database is assumed to contain many sequences of only good quality.
If a reference database is not present then it is feasible to create a reference database by using the input database as the reference database. Removing chimeras from the reference database and then iteratively repeating the process can result in a clean reference database.
For non-16S sequences it may be necessary to optimize the parameters for the particular sequences. The simplest way to perform an optimization is to experiment with different input parameters on artificial chimeras such as those created using CreateChimeras
. Adjusting input parameters until the maximum number of artificial chimeras are identified is the easiest way to determine new defaults.
A data.frame
containing only the sequences that meet the specifications for being chimeric. The chimera column contains information on the chimeric region and to which group it belongs. The row.names
of the data.frame
correspond to those of the sequences in the dbFile
.
Erik Wright eswright@pitt.edu
ES Wright et al. (2012) "DECIPHER: A Search-Based Approach to Chimera Identification for 16S rRNA Sequences." Applied and Environmental Microbiology, doi:10.1128/AEM.06516-11.
1 2 3 4 | db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
# It is necessary to set dbFileReference to the file path of the
# 16S reference database available from http://DECIPHER.codes
chimeras <- FindChimeras(db, dbFileReference=db)
|
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, basename, cbind, colMeans, colSums, colnames,
dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
intersect, is.unsorted, lapply, lengths, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following object is masked from 'package:base':
expand.grid
Loading required package: IRanges
Loading required package: XVector
Attaching package: 'Biostrings'
The following object is masked from 'package:base':
strsplit
Loading required package: RSQLite
No chimeras found.
Time difference of 0.2 secs
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.