filter_sequences: Filter DNA Sequences by PCR Replicates
In LocaTT: Geographically-Conscious Taxonomic Assignment for Metabarcoding

filter_sequences

R Documentation

Filter DNA Sequences by PCR Replicates

Description

Filters DNA sequences by minimum read count within a PCR replicate, minimum proportion within a PCR replicate, and number of detections across PCR replicates.

Usage

filter_sequences(
  input_files,
  samples,
  PCR_replicates,
  output_file,
  minimum_reads.PCR_replicate = 1,
  minimum_reads.sequence = 1,
  minimum_proportion.sequence = 0.005,
  binomial_test.enabled = TRUE,
  binomial_test.p.adjust.method = "none",
  binomial_test.alpha_level = 0.05,
  minimum_PCR_replicates = 2,
  delimiter.read_counts = ": ",
  delimiter.PCR_replicates = ", "
)

Arguments

`input_files`	A character vector of file paths to input FASTA files. DNA sequences in the input FASTA files are assumed to be summarized by frequency of occurrence, with each FASTA header line beginning with "Frequency: " and followed by the sequence's read count. Output FASTA files from `truncate_and_merge_pairs` have this format and can be used directly with this function. Each input FASTA file is assumed to contain the DNA sequence reads for a single PCR replicate for a single sample.
`samples`	A character vector of sample identifiers, with one element for each element of `input_files`.
`PCR_replicates`	A character vector of PCR replicate identifiers, with one element for each element of `input_files`.
`output_file`	String specifying path to output file of filtered sequences in CSV format.
`minimum_reads.PCR_replicate`	Numeric. PCR replicates which contain fewer reads than this value are discarded and do not contribute detections to any sequence. The default is `1` (i.e., no PCR replicates discarded).
`minimum_reads.sequence`	Numeric. For a sequence to be considered detected within a PCR replicate, the sequence's read count within the PCR replicate must match or exceed this value. The default is `1` (i.e., no filtering by minimum read count within PCR replicates).
`minimum_proportion.sequence`	Numeric. For a sequence to be considered detected within a PCR replicate, the proportion of reads in the PCR replicate comprised by the sequence must exceed this value. If `binomial_test.enabled = TRUE`, then this argument is used as the null hypothesis for a one-sided binomial test, and a significance test is used to determine whether the minimum proportion requirement for detection is satisfied instead. See the `binomial_test.enabled` argument below. The default is `0.005` (i.e., 0.5%). To disable sequence filtering by minimum proportion within PCR replicates, set to `0`.
`binomial_test.enabled`	Logical. If `TRUE` (the default), then for a sequence to be considered detected within a PCR replicate, the proportion of reads in the PCR replicate comprised by the sequence must significantly exceed the value of the `minimum_proportion.sequence` argument at the provided alpha level (`binomial_test.alpha_level` argument) based on a one-sided binomial test (i.e., `binomial_test` with `alternative = "greater"`). Optionally, p-values within a PCR replicate can be adjusted for multiple hypothesis testing by setting the `binomial_test.p.adjust.method` argument below. To disable significance testing, set to `FALSE` (minimum proportion filtering will still occur if `minimum_proportion.sequence > 0`, see above).
`binomial_test.p.adjust.method`	String specifying the p-value adjustment method for multiple hypothesis testing. p-value adjustments are performed within each PCR replicate for each sample. Passed to the `method` argument of `p.adjust` in the `stats` package. Available methods are contained within the `stats::p.adjust.methods` vector. If `"none"` (the default), then p-value adjustments are not performed. Ignored if `binomial_test.enabled = FALSE`.
`binomial_test.alpha_level`	Numeric. The alpha level used in deciding whether the proportion of reads in a PCR replicate comprised by a sequence significantly exceeds a minimum threshold required for detection. See the `binomial_test.enabled` argument. The default is `0.05`. Ignored if `binomial_test.enabled = FALSE`.
`minimum_PCR_replicates`	Numeric. The minimum number of PCR replicates in which a sequence must be detected in order to be considered present (i.e., not erroneous) in a sample. The default is `2`.
`delimiter.read_counts`	String specifying the delimiter between PCR replicate identifiers and sequence read counts in the Read_count_by_PCR_replicate field of the output CSV file (see details section). The default is `": "`.
`delimiter.PCR_replicates`	String specifying the delimiter between PCR replicates in the Read_count_by_PCR_replicate field of the output CSV file (see details section). The default is `", "`.

Details

For each set of input polymerase chain reaction (PCR) replicate FASTA files associated with a sample, writes out DNA sequences which are detected across a minimum number of PCR replicates (minimum_PCR_replicates argument). Detection within a PCR replicate is defined as a sequence having at least a minimum read count and exceeding a minimum proportion of reads (minimum_reads.sequence and minimum_proportion.sequence arguments, respectively). When binomial_test.enabled = TRUE, a sequence must significantly exceed the minimum proportion within a PCR replicate at the provided alpha level (binomial_test.alpha_level argument) based on a one-sided binomial test (i.e., binomial_test with alternative = "greater"). Within a PCR replicate, p-values can be adjusted for multiple hypothesis testing by setting the binomial_test.p.adjust.method argument (see stats::p.adjust.methods and p.adjust in the stats package). PCR replicates which contain fewer than a minimum number of reads are discarded (minimum_reads.PCR_replicate argument) and do not contribute detections to any sequence.

DNA sequences in the input FASTA files are assumed to be summarized by frequency of occurrence, with each FASTA header line beginning with "Frequency: " and followed by the sequence's read count. Output FASTA files from truncate_and_merge_pairs have this format and can be used directly with this function. Each input FASTA file is assumed to contain the DNA sequence reads for a single PCR replicate for a single sample.

For pipeline calibration purposes, a data frame containing unfiltered DNA sequences with their read counts, proportions, and p-values in each PCR replicate is invisibly returned (see return value section). While the primary output of this function is the written CSV file of filtered sequences (described below), the invisibly returned data frame of unfiltered sequences can be helpful when calibrating or troubleshooting filtering parameters. To aid in troubleshooting filtering parameters, the data frame is invisibly returned even if the error "Filtering removed all sequences" is received.

For the primary output, this function writes a CSV file of filtered DNA sequences with the following field definitions:

Sample: The sample name.
Sequence: The DNA sequence.
Detections_across_PCR_replicates: The number of PCR replicates the sequence was detected in.
Read_count_by_PCR_replicate: The sequence's read count in each PCR replicate the sequence was detected in.
Sequence_read_count: The sequence's total read count across the PCR replicates the sequence was detected in. Calculated as the sum of the read counts in the Read_count_by_PCR_replicate field.
Sample_read_count: The sample's total read count across all sequences detected in the PCR replicates. Calculated as the sum of the read counts in Sequence_read_count field associated with the sample.
Proportion_of_sample: The proportion of sample reads comprised by the sequence. Calculated by dividing the Sequence_read_count field by the Sample_read_count field. Equivalent to the weighted average of the sequence's proportion in each PCR replicate, with weights given by the proportion of the sample's total reads contained in each PCR replicate.

Value

Invisibly returns a data frame containing unfiltered DNA sequences with their read counts, proportions, and p-values in each PCR replicate. While the primary output of this function is the written CSV file of filtered sequences described in the details section, the invisibly returned data frame of unfiltered sequences can be helpful when calibrating or troubleshooting filtering parameters. To aid in troubleshooting filtering parameters, the data frame is invisibly returned even if the error "Filtering removed all sequences" is received. Field definitions for the invisibly returned data frame of unfiltered sequences are:

Sample: The sample name.
PCR_replicate: The PCR replicate identifier.
Sequence: The DNA sequence.
Read_count.sequence: The sequence's read count within the PCR replicate.
Read_count.PCR_replicate: The number of reads in the PCR replicate.
Proportion_of_PCR_replicate.observed: The proportion of reads in the PCR replicate comprised by the sequence.
Proportion_of_PCR_replicate.null (Field only present if binomial_test.enabled = TRUE): The null hypothesis for a one-sided binomial test (inherited from the minimum_proportion.sequence argument). See the p.value field below.
p.value (Field only present if binomial_test.enabled = TRUE): The p-value from a one-sided binomial test of whether the proportion of reads in the PCR replicate comprised by the sequence exceeds the null hypothesis (i.e., binomial_test with alternative = "greater").
p.value.adjusted (Field only present if binomial_test.enabled = TRUE): The p-value from the one-sided binomial test adjusted for multiple comparisons within each PCR replicate for each sample. See the p.value_adjustment_method field below.
p.value_adjustment_method (Field only present if binomial_test.enabled = TRUE): The p-value adjustment method (inherited from the binomial_test.p.adjust.method argument).

References

A manuscript describing these methods is in preparation.

Examples

# Get example FASTA files.
input_files<-system.file("extdata",
                         paste0(rep(x=paste0("S0",1:3),
                                    each=3),
                                "P0",1:3,".fasta"),
                         package="LocaTT",
                         mustWork=TRUE)

# Create path for temporary output file.
output_file<-tempfile(fileext=".csv")

# Specify samples.
samples<-rep(x=paste0("S0",1:3),each=3)

# Specify replicates.
PCR_replicates<-rep(x=paste0("P0",1:3),times=3)

# Filter sequences.
filter_sequences(input_files=input_files,
                 samples=samples,
                 PCR_replicates=PCR_replicates,
                 output_file=output_file)

LocaTT documentation built on June 14, 2026, 1:06 a.m.