remove.cont: Decontaminate metabarcoding data based on optimal regressions

View source: R/decon.R

remove.contR Documentation

Decontaminate metabarcoding data based on optimal regressions

Description

Takes a data frame of metabacroding reads (structured as a column of OTU IDs, followed by at least one column of reads from blanks, followed by columns of reads from samples, optionally followed by a column of taxonomic information). Ii identifies the reads that are from contamination, then removes them. It estimates the number of overlapping OTUs between the sample and the blank, and it chooses the best equation based on that.

Usage

remove.cont(data, numb.blanks = 1, taxa = T, runs = 2,
  regression = 0, low.threshold = 40, up.threshold = 400)

Arguments

data

A data frame of metabarcoding read data consisting of at least 3 columns in this order: a column of unique OTU names/labels, at least one column of read data from a blank sample (this contains your known contaminant reads), at least one column of read data for an actual sample (each column is a sample, each row is an OTU, and each cell is the number of reads). It can optionally include a final column with taxonomy information. If multiple blanks are included (recommended), they must be in consecutive columns, starting with column 2. Individuals must be ordered by group (e.g., species, populations, etc.).

numb.blanks

Numeric (default = 1). Specifies the number of blanks included in the data set (if multiple blanks are included, they must be in consecutive columns, starting with column 2).

taxa

Logical (T/F). Specifies whether or not the last column contains taxonomic information (default = T)

runs

Numeric (default = 2). Specifies the number of times that the function should run the decontamination procedure on the data. Based on simulation results, using two runs is best on average, but using one run is better if there is very little contamination, and using more than two runs is better if there is substantial contamination (see User Guide section 1.4.3).

regression

Numeric (default = 0). Specifies the regression equation used to calculate the constant. 0 = it chooses between regression 1 and regression 2 based on the low.threshold and up.threshold arguments (this is strongly recommended). 1 = it always uses regression 1. 2 = it always uses regression 2. See User Guide section 1.4.2.

low.threshold

Numeric (default = 40). Selects the lower point for switching between regression 1 and regression 2. It uses regression 2 anytime that the estimated overlap is <low.threshold or >up.threshold. It is usually best not to change this value.

up.threshold

Numeric (default = 400). Selects the higher point for switching between regression 1 and regression 2. It uses regression 2 anytime that the estimated overlap is <low.threshold or >up.threshold. It is usually best not to change this value.

Value

A data frame structured the same as the input but with the contaminant reads removed (rows may have been re-ordered). Additionally, all blank columns are condensed into a single mean blank column (the mean of the proportions of blanks times the mean number of reads in the blanks).


donaldtmcknight/microDecon documentation built on Oct. 23, 2023, 10:57 a.m.