In nyoungb2/HTSSIP: High Throughput Sequencing of Stable Isotope Probing Data Analysis

Introduction

heavy-SIP method workflow:

Prior to the development of these HTS-SIP analysis methods, DNA- and RNA-SIP experiments that utilized Sanger or high throughput sequencing were usually analyzed with standard statistical processes (e.g. t-tests), in order to identify incorporators. Previous work suggests that these methods generally have low senstivity and/or high false positive rates when applied to sequence data. Here, these analysis methods will be referred to "heavy-SIP" methods. While the work of Youngblut et al., (https://doi.org/10.1101/138719) suggests that HR-SIP analysis methods (eg., MW-HR-SIP) should be used for processing HTS-SIP datasets, the HTSSIP R package provides heavy-SIP methods so researchers have the option of using these methods and making their own comparisons to HR-SIP methods.

heavy-SIP is performed with the heavy_SIP() function, which consists of multiple possible tests. See ?heavy_SIP for more details. This vignette demonstrates the use of heavy_SIP().

Initialization

First, let's load some packages including HTSSIP.

library(dplyr)
library(ggplot2)
library(HTSSIP)

# adjusted P-value cutoff 
padj_cutoff = 0.1

Unreplicated dataset

For unreplicated datasets (no experiment replicates of controls or treatments), the options are limited on how to identify incorporators.

Parsing the dataset

We will be using a dataset that is already parsed. See HTSSIP introduction vignette for a description on why dataset parsing (all treatment-control comparisons) is needed.

physeq_S2D2_l

One treatment-control comparison

First, we'll just focus on 1 treatment-control comparison. Let's get the individual phyloseq object.

physeq = physeq_S2D2_l[[1]]
physeq

Let's check that the samples belong to either a 13C-treatment or 12C-control.

physeq %>% sample_data %>% .$Substrate %>% table

Since this dataset is an unreplicated comparison between treatment & control, we are just going to use the 'binary' method, which will call incorporators if they are present in the "heavy" gradient fractions of the treatment and not present in the "heavy" fractions of the control. Note that the "heavy" fractions are user-defined.

df_res = heavy_SIP(physeq, ex="Substrate=='12C-Con'", 
                   comparison='H-v-H', hypo_test='binary')
df_res %>% head(n=3)

Since no real statistical test, the "statistic" is just 0 (not an incorporator) or 1 (an incorporator). Also, the "p" and "padj" columns are thus "NA".

How many "incorporators"?

df_res$statistic %>% table

Replicated dataset

Experimental replicates allows us to use tradional hypothesis testing (e.g., t-tests) for determining significantly differ OTU abundances between treatment and controls. Note that there is a reason why more suffisticated statistical methods have been developed for assessing differentially abundant features in high throughput sequencing datasets (e.g., DESeq2, EdgeR, or MetagenomeSeq). The traditional methods don't account for many challenging aspects of identifying statistically different abundances in sequence data such as i) a high number of multiple hypotheses ii) zero-inflation iii) compositional data (relative abundances; the sum-to-one constraint).

With that said, let's try out these heavy-SIP methods on a replicated dataset, with 3 experimental replicates of the control and treatment (total gradients = 6)

physeq_rep3

physeq_rep3 %>% sample_data %>% head(n=3)

t-tests

To compare "heavy" fractions in the treatment versus "heavy" fractions in the control, we will use the "H-v-H" comparison method. See ?heavy_SIP for details on other possible comparisons.

df_res = heavy_SIP(physeq_rep3, ex="Treatment=='12C-Con'", 
                   comparison='H-v-H', hypo_test='t-test')
df_res %>% head(n=3)

"padj" is p-values adjusted with the Benjamini Hochberg method.

How many incorporators?

df_res %>%
  filter(padj < padj_cutoff) %>%
  nrow

No incorporators. Obviously, the sensitivity of this method is pretty low. What's the distribution of p-values?

df_res$p %>% summary

Mann Whitney U test

Does anything change when we use a nonparametric test? Here, we will use the Mann Whitney U test (a nonparametric t-test).

df_res = heavy_SIP(physeq_rep3, ex="Treatment=='12C-Con'", 
                   comparison='H-v-H', hypo_test='wilcox')
df_res %>% head(n=3)

What's the p-value and adjusted-pvalue distribution?

df_res$p %>% summary %>% print
df_res$padj %>% summary %>% print

Again, no incorporators. The change in abundances must be pretty dramatic for heavy-SIP methods to ID incorporators, especially when there's many multiple hypotheses.