View source: R/filter_sequences.R
| filter_sequences | R Documentation |
Filters DNA sequences by minimum read count within a PCR replicate, minimum proportion within a PCR replicate, and number of detections across PCR replicates.
filter_sequences(
input_files,
samples,
PCR_replicates,
output_file,
minimum_reads.PCR_replicate = 1,
minimum_reads.sequence = 1,
minimum_proportion.sequence = 0.005,
binomial_test.enabled = TRUE,
binomial_test.p.adjust.method = "none",
binomial_test.alpha_level = 0.05,
minimum_PCR_replicates = 2,
delimiter.read_counts = ": ",
delimiter.PCR_replicates = ", "
)
input_files |
A character vector of file paths to input FASTA files. DNA sequences in the input FASTA files are assumed to be summarized by frequency of occurrence, with each FASTA header line beginning with "Frequency: " and followed by the sequence's read count. Output FASTA files from |
samples |
A character vector of sample identifiers, with one element for each element of |
PCR_replicates |
A character vector of PCR replicate identifiers, with one element for each element of |
output_file |
String specifying path to output file of filtered sequences in CSV format. |
minimum_reads.PCR_replicate |
Numeric. PCR replicates which contain fewer reads than this value are discarded and do not contribute detections to any sequence. The default is |
minimum_reads.sequence |
Numeric. For a sequence to be considered detected within a PCR replicate, the sequence's read count within the PCR replicate must match or exceed this value. The default is |
minimum_proportion.sequence |
Numeric. For a sequence to be considered detected within a PCR replicate, the proportion of reads in the PCR replicate comprised by the sequence must exceed this value. If |
binomial_test.enabled |
Logical. If |
binomial_test.p.adjust.method |
String specifying the p-value adjustment method for multiple hypothesis testing. p-value adjustments are performed within each PCR replicate for each sample. Passed to the |
binomial_test.alpha_level |
Numeric. The alpha level used in deciding whether the proportion of reads in a PCR replicate comprised by a sequence significantly exceeds a minimum threshold required for detection. See the |
minimum_PCR_replicates |
Numeric. The minimum number of PCR replicates in which a sequence must be detected in order to be considered present (i.e., not erroneous) in a sample. The default is |
delimiter.read_counts |
String specifying the delimiter between PCR replicate identifiers and sequence read counts in the Read_count_by_PCR_replicate field of the output CSV file (see details section). The default is |
delimiter.PCR_replicates |
String specifying the delimiter between PCR replicates in the Read_count_by_PCR_replicate field of the output CSV file (see details section). The default is |
For each set of input polymerase chain reaction (PCR) replicate FASTA files associated with a sample, writes out DNA sequences which are detected across a minimum number of PCR replicates (minimum_PCR_replicates argument). Detection within a PCR replicate is defined as a sequence having at least a minimum read count and exceeding a minimum proportion of reads (minimum_reads.sequence and minimum_proportion.sequence arguments, respectively). When binomial_test.enabled = TRUE, a sequence must significantly exceed the minimum proportion within a PCR replicate at the provided alpha level (binomial_test.alpha_level argument) based on a one-sided binomial test (i.e., binomial_test with alternative = "greater"). Within a PCR replicate, p-values can be adjusted for multiple hypothesis testing by setting the binomial_test.p.adjust.method argument (see stats::p.adjust.methods and p.adjust in the stats package). PCR replicates which contain fewer than a minimum number of reads are discarded (minimum_reads.PCR_replicate argument) and do not contribute detections to any sequence.
DNA sequences in the input FASTA files are assumed to be summarized by frequency of occurrence, with each FASTA header line beginning with "Frequency: " and followed by the sequence's read count. Output FASTA files from truncate_and_merge_pairs have this format and can be used directly with this function. Each input FASTA file is assumed to contain the DNA sequence reads for a single PCR replicate for a single sample.
For pipeline calibration purposes, a data frame containing unfiltered DNA sequences with their read counts, proportions, and p-values in each PCR replicate is invisibly returned (see return value section). While the primary output of this function is the written CSV file of filtered sequences (described below), the invisibly returned data frame of unfiltered sequences can be helpful when calibrating or troubleshooting filtering parameters. To aid in troubleshooting filtering parameters, the data frame is invisibly returned even if the error "Filtering removed all sequences" is received.
For the primary output, this function writes a CSV file of filtered DNA sequences with the following field definitions:
Sample: The sample name.
Sequence: The DNA sequence.
Detections_across_PCR_replicates: The number of PCR replicates the sequence was detected in.
Read_count_by_PCR_replicate: The sequence's read count in each PCR replicate the sequence was detected in.
Sequence_read_count: The sequence's total read count across the PCR replicates the sequence was detected in. Calculated as the sum of the read counts in the Read_count_by_PCR_replicate field.
Sample_read_count: The sample's total read count across all sequences detected in the PCR replicates. Calculated as the sum of the read counts in Sequence_read_count field associated with the sample.
Proportion_of_sample: The proportion of sample reads comprised by the sequence. Calculated by dividing the Sequence_read_count field by the Sample_read_count field. Equivalent to the weighted average of the sequence's proportion in each PCR replicate, with weights given by the proportion of the sample's total reads contained in each PCR replicate.
Invisibly returns a data frame containing unfiltered DNA sequences with their read counts, proportions, and p-values in each PCR replicate. While the primary output of this function is the written CSV file of filtered sequences described in the details section, the invisibly returned data frame of unfiltered sequences can be helpful when calibrating or troubleshooting filtering parameters. To aid in troubleshooting filtering parameters, the data frame is invisibly returned even if the error "Filtering removed all sequences" is received. Field definitions for the invisibly returned data frame of unfiltered sequences are:
Sample: The sample name.
PCR_replicate: The PCR replicate identifier.
Sequence: The DNA sequence.
Read_count.sequence: The sequence's read count within the PCR replicate.
Read_count.PCR_replicate: The number of reads in the PCR replicate.
Proportion_of_PCR_replicate.observed: The proportion of reads in the PCR replicate comprised by the sequence.
Proportion_of_PCR_replicate.null (Field only present if binomial_test.enabled = TRUE): The null hypothesis for a one-sided binomial test (inherited from the minimum_proportion.sequence argument). See the p.value field below.
p.value (Field only present if binomial_test.enabled = TRUE): The p-value from a one-sided binomial test of whether the proportion of reads in the PCR replicate comprised by the sequence exceeds the null hypothesis (i.e., binomial_test with alternative = "greater").
p.value.adjusted (Field only present if binomial_test.enabled = TRUE): The p-value from the one-sided binomial test adjusted for multiple comparisons within each PCR replicate for each sample. See the p.value_adjustment_method field below.
p.value_adjustment_method (Field only present if binomial_test.enabled = TRUE): The p-value adjustment method (inherited from the binomial_test.p.adjust.method argument).
A manuscript describing these methods is in preparation.
binomial_test for performing vectorized one-sided binomial tests.
truncate_and_merge_pairs for truncating and merging read pairs prior to sequence filtering.
local_taxa_tool for performing geographically-conscious taxonomic assignment of filtered sequences.
# Get example FASTA files.
input_files<-system.file("extdata",
paste0(rep(x=paste0("S0",1:3),
each=3),
"P0",1:3,".fasta"),
package="LocaTT",
mustWork=TRUE)
# Create path for temporary output file.
output_file<-tempfile(fileext=".csv")
# Specify samples.
samples<-rep(x=paste0("S0",1:3),each=3)
# Specify replicates.
PCR_replicates<-rep(x=paste0("P0",1:3),times=3)
# Filter sequences.
filter_sequences(input_files=input_files,
samples=samples,
PCR_replicates=PCR_replicates,
output_file=output_file)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.