Prior to the development of these HTS-SIP analysis methods, DNA- and RNA-SIP experiments that utilized Sanger or high throughput sequencing were usually analyzed with standard statistical processes (e.g. t-tests), in order to identify incorporators. Previous work suggests that these methods generally have low senstivity and/or high false positive rates when applied to sequence data. Here, these analysis methods will be referred to "heavy-SIP" methods. While the work of Youngblut et al., (https://doi.org/10.1101/138719) suggests that HR-SIP analysis methods (eg., MW-HR-SIP) should be used for processing HTS-SIP datasets, the HTSSIP R package provides heavy-SIP methods so researchers have the option of using these methods and making their own comparisons to HR-SIP methods.
heavy-SIP is performed with the heavy_SIP()
function, which consists of multiple possible tests. See ?heavy_SIP
for more details. This vignette demonstrates the use of heavy_SIP()
.
First, let's load some packages including HTSSIP
.
library(dplyr) library(ggplot2) library(HTSSIP)
# adjusted P-value cutoff padj_cutoff = 0.1
For unreplicated datasets (no experiment replicates of controls or treatments), the options are limited on how to identify incorporators.
We will be using a dataset that is already parsed. See HTSSIP introduction vignette for a description on why dataset parsing (all treatment-control comparisons) is needed.
physeq_S2D2_l
First, we'll just focus on 1 treatment-control comparison. Let's get the individual phyloseq object.
physeq = physeq_S2D2_l[[1]]
physeq
Let's check that the samples belong to either a 13C-treatment or 12C-control.
physeq %>% sample_data %>% .$Substrate %>% table
Since this dataset is an unreplicated comparison between treatment & control, we are just going to use the 'binary' method, which will call incorporators if they are present in the "heavy" gradient fractions of the treatment and not present in the "heavy" fractions of the control. Note that the "heavy" fractions are user-defined.
df_res = heavy_SIP(physeq, ex="Substrate=='12C-Con'", comparison='H-v-H', hypo_test='binary') df_res %>% head(n=3)
Since no real statistical test, the "statistic" is just 0 (not an incorporator) or 1 (an incorporator). Also, the "p" and "padj" columns are thus "NA".
How many "incorporators"?
df_res$statistic %>% table
Experimental replicates allows us to use tradional hypothesis testing (e.g., t-tests) for determining significantly differ OTU abundances between treatment and controls. Note that there is a reason why more suffisticated statistical methods have been developed for assessing differentially abundant features in high throughput sequencing datasets (e.g., DESeq2, EdgeR, or MetagenomeSeq). The traditional methods don't account for many challenging aspects of identifying statistically different abundances in sequence data such as i) a high number of multiple hypotheses ii) zero-inflation iii) compositional data (relative abundances; the sum-to-one constraint).
With that said, let's try out these heavy-SIP methods on a replicated dataset, with 3 experimental replicates of the control and treatment (total gradients = 6)
physeq_rep3
physeq_rep3 %>% sample_data %>% head(n=3)
To compare "heavy" fractions in the treatment versus "heavy" fractions in the control, we will use the "H-v-H" comparison method. See ?heavy_SIP
for details on other possible comparisons.
df_res = heavy_SIP(physeq_rep3, ex="Treatment=='12C-Con'", comparison='H-v-H', hypo_test='t-test') df_res %>% head(n=3)
"padj" is p-values adjusted with the Benjamini Hochberg method.
How many incorporators?
df_res %>% filter(padj < padj_cutoff) %>% nrow
No incorporators. Obviously, the sensitivity of this method is pretty low. What's the distribution of p-values?
df_res$p %>% summary
Does anything change when we use a nonparametric test? Here, we will use the Mann Whitney U test (a nonparametric t-test).
df_res = heavy_SIP(physeq_rep3, ex="Treatment=='12C-Con'", comparison='H-v-H', hypo_test='wilcox') df_res %>% head(n=3)
What's the p-value and adjusted-pvalue distribution?
df_res$p %>% summary %>% print df_res$padj %>% summary %>% print
Again, no incorporators. The change in abundances must be pretty dramatic for heavy-SIP methods to ID incorporators, especially when there's many multiple hypotheses.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.