filter_sequences: Filter DNA Sequences by PCR Replicates

View source: R/filter_sequences.R

filter_sequencesR Documentation

Filter DNA Sequences by PCR Replicates

Description

Filters DNA sequences by minimum read count within a PCR replicate, minimum proportion within a PCR replicate, and number of detections across PCR replicates.

Usage

filter_sequences(
  input_files,
  samples,
  PCR_replicates,
  output_file,
  minimum_reads.PCR_replicate = 1,
  minimum_reads.sequence = 1,
  minimum_proportion.sequence = 0.005,
  binomial_test.enabled = TRUE,
  binomial_test.p.adjust.method = "none",
  binomial_test.alpha_level = 0.05,
  minimum_PCR_replicates = 2,
  delimiter.read_counts = ": ",
  delimiter.PCR_replicates = ", "
)

Arguments

input_files

A character vector of file paths to input FASTA files. DNA sequences in the input FASTA files are assumed to be summarized by frequency of occurrence, with each FASTA header line beginning with "Frequency: " and followed by the sequence's read count. Output FASTA files from truncate_and_merge_pairs have this format and can be used directly with this function. Each input FASTA file is assumed to contain the DNA sequence reads for a single PCR replicate for a single sample.

samples

A character vector of sample identifiers, with one element for each element of input_files.

PCR_replicates

A character vector of PCR replicate identifiers, with one element for each element of input_files.

output_file

String specifying path to output file of filtered sequences in CSV format.

minimum_reads.PCR_replicate

Numeric. PCR replicates which contain fewer reads than this value are discarded and do not contribute detections to any sequence. The default is 1 (i.e., no PCR replicates discarded).

minimum_reads.sequence

Numeric. For a sequence to be considered detected within a PCR replicate, the sequence's read count within the PCR replicate must match or exceed this value. The default is 1 (i.e., no filtering by minimum read count within PCR replicates).

minimum_proportion.sequence

Numeric. For a sequence to be considered detected within a PCR replicate, the proportion of reads in the PCR replicate comprised by the sequence must exceed this value. If binomial_test.enabled = TRUE, then this argument is used as the null hypothesis for a one-sided binomial test, and a significance test is used to determine whether the minimum proportion requirement for detection is satisfied instead. See the binomial_test.enabled argument below. The default is 0.005 (i.e., 0.5%). To disable sequence filtering by minimum proportion within PCR replicates, set to 0.

binomial_test.enabled

Logical. If TRUE (the default), then for a sequence to be considered detected within a PCR replicate, the proportion of reads in the PCR replicate comprised by the sequence must significantly exceed the value of the minimum_proportion.sequence argument at the provided alpha level (binomial_test.alpha_level argument) based on a one-sided binomial test (i.e., binomial_test with alternative = "greater"). Optionally, p-values within a PCR replicate can be adjusted for multiple hypothesis testing by setting the binomial_test.p.adjust.method argument below. To disable significance testing, set to FALSE (minimum proportion filtering will still occur if minimum_proportion.sequence > 0, see above).

binomial_test.p.adjust.method

String specifying the p-value adjustment method for multiple hypothesis testing. p-value adjustments are performed within each PCR replicate for each sample. Passed to the method argument of p.adjust in the stats package. Available methods are contained within the stats::p.adjust.methods vector. If "none" (the default), then p-value adjustments are not performed. Ignored if binomial_test.enabled = FALSE.

binomial_test.alpha_level

Numeric. The alpha level used in deciding whether the proportion of reads in a PCR replicate comprised by a sequence significantly exceeds a minimum threshold required for detection. See the binomial_test.enabled argument. The default is 0.05. Ignored if binomial_test.enabled = FALSE.

minimum_PCR_replicates

Numeric. The minimum number of PCR replicates in which a sequence must be detected in order to be considered present (i.e., not erroneous) in a sample. The default is 2.

delimiter.read_counts

String specifying the delimiter between PCR replicate identifiers and sequence read counts in the Read_count_by_PCR_replicate field of the output CSV file (see details section). The default is ": ".

delimiter.PCR_replicates

String specifying the delimiter between PCR replicates in the Read_count_by_PCR_replicate field of the output CSV file (see details section). The default is ", ".

Details

For each set of input polymerase chain reaction (PCR) replicate FASTA files associated with a sample, writes out DNA sequences which are detected across a minimum number of PCR replicates (minimum_PCR_replicates argument). Detection within a PCR replicate is defined as a sequence having at least a minimum read count and exceeding a minimum proportion of reads (minimum_reads.sequence and minimum_proportion.sequence arguments, respectively). When binomial_test.enabled = TRUE, a sequence must significantly exceed the minimum proportion within a PCR replicate at the provided alpha level (binomial_test.alpha_level argument) based on a one-sided binomial test (i.e., binomial_test with alternative = "greater"). Within a PCR replicate, p-values can be adjusted for multiple hypothesis testing by setting the binomial_test.p.adjust.method argument (see stats::p.adjust.methods and p.adjust in the stats package). PCR replicates which contain fewer than a minimum number of reads are discarded (minimum_reads.PCR_replicate argument) and do not contribute detections to any sequence.

DNA sequences in the input FASTA files are assumed to be summarized by frequency of occurrence, with each FASTA header line beginning with "Frequency: " and followed by the sequence's read count. Output FASTA files from truncate_and_merge_pairs have this format and can be used directly with this function. Each input FASTA file is assumed to contain the DNA sequence reads for a single PCR replicate for a single sample.

For pipeline calibration purposes, a data frame containing unfiltered DNA sequences with their read counts, proportions, and p-values in each PCR replicate is invisibly returned (see return value section). While the primary output of this function is the written CSV file of filtered sequences (described below), the invisibly returned data frame of unfiltered sequences can be helpful when calibrating or troubleshooting filtering parameters. To aid in troubleshooting filtering parameters, the data frame is invisibly returned even if the error "Filtering removed all sequences" is received.

For the primary output, this function writes a CSV file of filtered DNA sequences with the following field definitions:

  • Sample: The sample name.

  • Sequence: The DNA sequence.

  • Detections_across_PCR_replicates: The number of PCR replicates the sequence was detected in.

  • Read_count_by_PCR_replicate: The sequence's read count in each PCR replicate the sequence was detected in.

  • Sequence_read_count: The sequence's total read count across the PCR replicates the sequence was detected in. Calculated as the sum of the read counts in the Read_count_by_PCR_replicate field.

  • Sample_read_count: The sample's total read count across all sequences detected in the PCR replicates. Calculated as the sum of the read counts in Sequence_read_count field associated with the sample.

  • Proportion_of_sample: The proportion of sample reads comprised by the sequence. Calculated by dividing the Sequence_read_count field by the Sample_read_count field. Equivalent to the weighted average of the sequence's proportion in each PCR replicate, with weights given by the proportion of the sample's total reads contained in each PCR replicate.

Value

Invisibly returns a data frame containing unfiltered DNA sequences with their read counts, proportions, and p-values in each PCR replicate. While the primary output of this function is the written CSV file of filtered sequences described in the details section, the invisibly returned data frame of unfiltered sequences can be helpful when calibrating or troubleshooting filtering parameters. To aid in troubleshooting filtering parameters, the data frame is invisibly returned even if the error "Filtering removed all sequences" is received. Field definitions for the invisibly returned data frame of unfiltered sequences are:

  • Sample: The sample name.

  • PCR_replicate: The PCR replicate identifier.

  • Sequence: The DNA sequence.

  • Read_count.sequence: The sequence's read count within the PCR replicate.

  • Read_count.PCR_replicate: The number of reads in the PCR replicate.

  • Proportion_of_PCR_replicate.observed: The proportion of reads in the PCR replicate comprised by the sequence.

  • Proportion_of_PCR_replicate.null (Field only present if binomial_test.enabled = TRUE): The null hypothesis for a one-sided binomial test (inherited from the minimum_proportion.sequence argument). See the p.value field below.

  • p.value (Field only present if binomial_test.enabled = TRUE): The p-value from a one-sided binomial test of whether the proportion of reads in the PCR replicate comprised by the sequence exceeds the null hypothesis (i.e., binomial_test with alternative = "greater").

  • p.value.adjusted (Field only present if binomial_test.enabled = TRUE): The p-value from the one-sided binomial test adjusted for multiple comparisons within each PCR replicate for each sample. See the p.value_adjustment_method field below.

  • p.value_adjustment_method (Field only present if binomial_test.enabled = TRUE): The p-value adjustment method (inherited from the binomial_test.p.adjust.method argument).

References

A manuscript describing these methods is in preparation.

See Also

binomial_test for performing vectorized one-sided binomial tests.

truncate_and_merge_pairs for truncating and merging read pairs prior to sequence filtering.

local_taxa_tool for performing geographically-conscious taxonomic assignment of filtered sequences.

Examples

# Get example FASTA files.
input_files<-system.file("extdata",
                         paste0(rep(x=paste0("S0",1:3),
                                    each=3),
                                "P0",1:3,".fasta"),
                         package="LocaTT",
                         mustWork=TRUE)

# Create path for temporary output file.
output_file<-tempfile(fileext=".csv")

# Specify samples.
samples<-rep(x=paste0("S0",1:3),each=3)

# Specify replicates.
PCR_replicates<-rep(x=paste0("P0",1:3),times=3)

# Filter sequences.
filter_sequences(input_files=input_files,
                 samples=samples,
                 PCR_replicates=PCR_replicates,
                 output_file=output_file)

LocaTT documentation built on June 14, 2026, 1:06 a.m.