filtR
is an R function with the purpose of filtering out error-derived operational taxonomic units (OTUs) or amplicon sequence variants (ASVs) from count tables. filtR
integrates the functionality of two R packages, ALDEx2
[^1] and propr
[^2]. If an OTU/ASV is deemed to be error-derived by filtR, it will be removed from the count table.
[^1]: Fernandes et al (2013) PLOS ONE https://doi.org/10.1371/journal.pone.0067019 [^2]: Quinn et al (2017) Sci Rep. https://dx.doi.org/10.1038%2Fs41598-017-16520-0
The $\rho$ statistic can be used to assess $\beta$-association between a pair of features[^3]. To achieve a perfect $\rho$ value of 1, the log ratio abundances of two features (in our example, bacteria) need to have a constant ratio of 1 across all samples. In other words, in every sample each bacteria will have exactly equal log ratio abundances. In real sequencing data, this perfect one-to-one relationship will never be observed, so a $\rho$ cutoff of 0.7 is used to find features that are strongly associated. As the value of $\rho$ approaches 1, it can be assumed that the two features used to calculate $\rho$ have a constant ratio across the dataset and are therefore associated[^4].
Below is a graphical representation of two features that are associated:
knitr::include_graphics("good_example.png")
The premise of ASVs and OTUs is generally that each individual ASV and OTU, for the most part, should lack association with other ASVs/OTUs because they are independent entities. However, simply a strong association between two features--measured by $\rho$--is not sufficient to identify an ASV/OTU that was generated by errors introduced in the sequencing process. As shown in a study of the microbiota of Chinese individuals[^5], the error-derived OTUs tend to have a difference in abundance of greater than 32-fold when compared to their parent OTU. As shown in Figure 1, the difference in centered log ratio was roughly 5, which works out to a difference in abundance of 32-fold.
In essence, the rarer OTUs that were strongly associated with the prevalent OTU were likely sequences derived from errors generated during sequencing or PCR. Since these OTUs do not represent real biological diversity, labelling them as their own entity is a false positive when assessing overall diversity of a microbiome. filtR
solves this issue by discarding error-derived OTUs.
Install the latest version of filtR
from GitHub with devtools::install_github('bjoris33/filtR')
.
Load in the R package.
library(filtR)
With these two steps completed, we are now ready to use the filtR()
function of the filtR
package.
The format of the data that filtR
requires is a count table with OTUs/ASVs as its rows and samples as the columns.
knitr::include_graphics("count_table.png")
The count table can contain a column for taxonomic assignment, but it is not a requirement. For the purposes of the this vignette, a count table with taxonomic assignments from a study of the relationship between the oral microbiome and caries from the European Bioinformatics Institute (EBI) will be used as an example.
Let's begin by loading in the count table from EBI:
unfilt <- read.table('SRP078943_taxonomy_abundances_SSU_v4.1.tsv') dim(unfilt)
The initial count table (shown in Figure 2) has 319 rows (which represent the OTUs) and 38 columns, which are the 37 samples and one column for the taxonomic assignments for the OTUs.
With a count table selected, we can now use the filtR function:
filtered <- filtR('SRP078943_taxonomy_abundances_SSU_v4.1.tsv') dim(filtered)
As we can see from the output, filtR filtered 4 OTUs from the count table. The original dimensions of the count table were 319 rows and 38 columns. Now that the 4 OTUs are filtered, the dimensions are 315 rows and 38 columns. The count table is in the same format it originally was in, just now with the error-derived OTUs removed.
filtR
is a simple function with 3 modifiable parameters:
count_file
: the file name of the count table file that is being filtered.
rho_CO
: Defaults to 0.7. The value of $\rho$ (range from 0 to 1) that is being used as a cutoff for analysis. Pairs of features with a greater value of $\rho$ will be indexed. If filtR is failing to index any pairs of OTUs/ASVs, this parameter can be lowered. However, this increases the chance that the filter will eliminate 'real' OTUs/ASVs and it is recommended to use the default cutoff or higher.
clr_CO
: Defaults to 5. This filtering parameter sets the minimum difference in centered log ratio between the pairs of OTUs indexed by the $\rho$ cutoff. To increase the number of OTUs/ASVs that are filtered by filtR, this parameter can be lowered. However, this increases the chance that the filter will eliminate 'real' OTUs/ASVs and it is recommended to use the default cutoff or higher.
plot
: Set to TRUE
to create scatter plots of the CLRs of the filtered OTU plotted against the CLRs of the OTU from which they are derived. The black, solid line is the ideal slope of the line for association (slope of 1) and the red dashed line is the slope of the linear model of the data. Defaults to FALSE
.
Let's explore how modifying these filtering parameters individually can change the results of filtering.
If we lower the $\rho$ cutoff to 0.6:
filtered <- filtR('SRP078943_taxonomy_abundances_SSU_v4.1.tsv', rho_CO = 0.6)
Now we have 7 OTUs being filtered from the count table. Because we are lowering the cutoff value below the recommended level, it would be wise to use the plotting functionality of filtR
to check the shape of the data before committing to a looser filter.
If we lower the centered log ratio difference cutoff to 4 from 5 (and keep the $\rho$ cutoff at the default):
filtered <- filtR('SRP078943_taxonomy_abundances_SSU_v4.1.tsv', clr_CO = 4)
With the CLR difference cutoff lowered, 6 OTUs are filtered from the count table. Because we are lowering the cutoff value below the recommended level, it would be wise to use the plotting functionality of filtR
to check the shape of the data before committing to a looser filter.
filtR
To use the plotting functionailty of filtR
set plot
equal to TRUE
(defaults to FALSE
).
filtered <- filtR('SRP078943_taxonomy_abundances_SSU_v4.1.tsv', plot=TRUE)
Each point in this graph is representative of the centered log ratios for the two OTUs for a single sample. The dashed, red line is the linear model of the data and the solid, black line has the same y-intercept as the red line, but has the optimal slope of one. The data points should fit relatively well along a line with slope one for filtR
to filter it. It is recommended to plot your data to ensure filtR
is not erroneously filtering OTUs.
There are specific examples as to where filtR
may be tricked into improperly filtering data. If there are data that have two distinct microbial communities being sequenced (for this example oral and nasal), the data may pass the parameters, but should not be filtered.
knitr::include_graphics("group_separation.png")
However, inspection of these combined nasal and oral sequencing data that are being filtered can reveal OTUs, that probably should be removed (Figure 4). To ensure that erroneous filtering is occurring as infrequently as possible, it is important to separate your samples before applying filtR
if there is an extreme divide in the composition of the communities.
knitr::include_graphics("two_groups_good.png")
Also, there are generally instances where filtR
is choosing to filter and OTU which passes the CLR and $\rho$ cutoffs, but the data do not fit along a line of slope 1 (Figure 5). This highlights the importance of using the plot
functionality of filtR
to ensure that the correct data are being filtered.
knitr::include_graphics("bad_example.png")
When calculating $\rho$, the dispersion, the slope, and the y-intercept of the data are considered. As the dispersion of the data increases, the $\rho$ value decreases. Similarly, as the difference in log-ratio abundance increases between the two features increases, the value of $\rho$ decreases. This can be problematic for filtR
because the features that filtR
looks to remove from count tables are identified using relatively high differences in log-ratio abundance. The relationship between centered log ratio differences and $\rho$ values creates a sliding scale of optimal filtering. As shown below, a pair with a lower $\rho$ value may better fit the model of data we wish to filter than pairs with a higher $\rho$ value because it has a higher difference in log ratio abundance (Figure 6, Figure 7).
knitr::include_graphics("highrho_lowclr.png")
knitr::include_graphics("lowrho_highclr.png")
If the $\rho$ cutoff fails to index any pairs, the following error will be generated:
filtered <- filtR('SRP078943_taxonomy_abundances_SSU_v4.1.tsv', rho_CO = 0.99)
If the CLR difference cutoff is larger than the differences between the indexed pairs, the following error will be generated:
filtered <- filtR('SRP078943_taxonomy_abundances_SSU_v4.1.tsv', clr_CO = 50)
[^3]: Erb et al (2016) Theory Biosci. https://doi.org/10.1007/s12064-015-0220-8 [^4]: Lovell et al (2015) Plos Comput Biol. https://doi.org/10.1371/journal.pcbi.1004075 [^5]: Bian et al (2017) mSphere https://doi.org/10.1128/mSphere.00327-17
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.