In nyoungb2/HTSSIP: High Throughput Sequencing of Stable Isotope Probing Data Analysis

Introduction

HR-SIP method workflow:

Remove sparsely occuring OTUs under the assumption that they are sequencing errors.
- This is assuming that DNA from and OTU is likely going to be found in multiple gradient fractions due to differences in G+C content, diffusion, etc.
- So, if an OTU is just detected in only one fraction of a gradient, we can't tell whether it is a sequencing artifact or a real, but rare OTU.
For each OTU, use DESeq2 to detect significant enrichment in 'heavy' gradient fractions.
- 'heavy' gradients are selected ad hoc
- Multiple hypotheses are adjusted for.
- Significantly enriched OTUs are considered to have incorporated isotope (incorporators)

MW-HR-SIP method workflow:

The method is similar to HR-SIP, except instead of 1 'heavy' BD window used, multiple windows are used, but the tradeoff is the higher number of multiple hypotheses that must be corrected for.
- For each OTU, the BD window with highest log2 fold change (l2fc) value is used, so the BD window can vary among OTUs.
This method has been shown to be generally more sensitive (more true positives) than HR-SIP and more specific (less false positives) than q-SIP. See the manuscript "Multiple window High Resolution Stable Isotope Probing (MW-HR-SIP)" for more information.

Dataset

First, let's load some packages including HTSSIP.

library(dplyr)
library(ggplot2)
library(HTSSIP)

See HTSSIP introduction vignette for a description on why dataset parsing (all treatment-control comparisons) is needed.

Let's see the already parsed dataset

physeq_S2D2_l

HR-SIP

Let's set some parameters used later.

# adjusted P-value cutoff 
padj_cutoff = 0.1
# number of cores for parallel processing (increase depending on your computational hardware)
ncores = 2

One treatment-control comparison

First, we'll just run HR-SIP on 1 treatment-control comparison. Let's get the individual phyloseq object.

physeq = physeq_S2D2_l[[1]]
physeq

Let's check that the samples belong to either a 13C-treatment or 12C-control.

physeq %>% sample_data %>% .$Substrate %>% table

OK, we should be ready to run HR-SIP!

Note that the design parameter for HRSIP() is the experimental design parameter for calculating log2 fold change (l2fc) values with DESeq. Here, it's used to distinguish label-treatment and unlabel-control samples.

df_l2fc = HRSIP(physeq,
                design = ~Substrate,
                padj_cutoff = padj_cutoff,
                sparsity_threshold = c(0,0.15,0.3))   # just using 3 thresholds to reduce time
df_l2fc %>% head(n=3)

How many "incorporators"" (rejected hypotheses)?

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by() %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length)

Let's plot a breakdown of incorporators for each phylum.

# summarizing
df_l2fc_s = df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=45, hjust=1)
    )

All treatment-control comparisons

Let's now run HR-SIP on all treatment-control comparisons in the dataset:

# Number of comparisons
physeq_S2D2_l %>% length

The function plyr::ldply() is useful (compared to lapply()) beccause it can be run in parallel and returns a data.frame object.

# Running in parallel; you may need to change the backend for your environment.
# Or you can just set .parallel=FALSE. 
doParallel::registerDoParallel(ncores)

df_l2fc = plyr::ldply(physeq_S2D2_l, 
                      HRSIP, 
                      design = ~Substrate, 
                      padj_cutoff = padj_cutoff,
                      sparsity_threshold = c(0,0.15,0.3),  # just using 3 thresholds to reduce run time
                      .parallel=TRUE)
df_l2fc %>% head(n=3)

Each specific phyloseq subset (treatment-control comparison) is delimited with the ".id" column.

df_l2fc %>% .$.id %>% unique

For clarity, let's edit these long strings to make them more readable when plotted.

df_l2fc = df_l2fc %>%
  mutate(.id = gsub(' \\| ', '\n', .id))
df_l2fc %>% .$.id %>% unique

How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?

Note: you could set one sparsity cutoff for all comparisons by setting the sparsity_cutoff to a specific value.

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by(.id, sparsity_threshold) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  as.data.frame

How about a breakdown of incorporators for each phylum in each comparision.

# summarizing
df_l2fc_s = df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(.id, Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    facet_wrap(~ .id, scales='free') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

MW-HR-SIP

MW-HR-SIP is run very similarly to HRSIP, but it uses multiple buoyant density (BD) windows. MW-HR-SIP is performed with the HRSIP() function, but multiple BD windows are specified.

Let's use 3 buoyant density windows (g/ml):

1.70-1.73, 1.72-1.75, 1.74-1.77

windows = data.frame(density_min=c(1.70, 1.72, 1.74), 
                     density_max=c(1.73, 1.75, 1.77))
windows

Running HRSIP with all 3 BD windows. Let's run this in parallel to speed things up.

You can turn off parallel processing by setting the parallel option to FALSE. Also, there's 2 different levels that could be parallelized: either the ldply() or HRSIP(). Here, I'm running HRSIP in parallel, but it may make sense in other situations (eg., many comparisons but few density windows and/or sparsity cutoffs) to use ldply in parallel only.

doParallel::registerDoParallel(ncores)

df_l2fc = plyr::ldply(physeq_S2D2_l, 
                      HRSIP, 
                      density_windows = windows,
                      design = ~Substrate, 
                      padj_cutoff = padj_cutoff,
                      sparsity_threshold = c(0,0.15,0.3), # just using 3 thresholds to reduce run time
                      .parallel = TRUE)
df_l2fc %>% head(n=3)

Let's check that we have all treatment-control comparisons.

df_l2fc %>% .$.id %>% unique

How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?

Note: one sparsity cutoff could be set for all comparisons by setting the sparsity_cutoff to a specific value.

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by(.id, sparsity_threshold) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  as.data.frame

The density windows can vary for each OTU. Let's plot which density windows were used for the OTUs in the dataset.

# summarizing
df_l2fc_s = df_l2fc %>% 
  mutate(.id = gsub(' \\| ', '\n', .id)) %>%
  filter(padj < padj_cutoff) %>%
  mutate(density_range = paste(density_min, density_max, sep='-')) %>% 
  group_by(.id, sparsity_threshold, density_range) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) 

#plotting
ggplot(df_l2fc_s, aes(.id, n_incorp_OTUs, fill=density_range)) +
    geom_bar(stat='identity', position='fill') +
    labs(x='Control-treatment comparision', y='Fraction of incorporators') +
    scale_y_continuous(expand=c(0,0)) +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

Different BD windows were used for different treatment-control comparisons because the amount of BD shift likely varied among taxa. For example, if a taxon incorporates 100% 13C isotope, then a very 'heavy' BD window may show a larger l2fc than a less 'heavy' BD window.

Let's look at a breakdown of incorporators for each phylum in each comparision.

# summarizing
df_l2fc_s = df_l2fc %>% 
  mutate(.id = gsub(' \\| ', '\n', .id)) %>%
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(.id, Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    facet_wrap(~ .id, scales='free') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

Note that the MW-HR-SIP method identifies more incorporators than the HR-SIP method (which uses just one BD window).

MW-HR-SIP detects more taxa for 2 main reasons. First, taxa vary in G+C content, so using only 1 BD window likely encompasses BD shifts for taxa of certain G+C contents (eg., ~50% G+C), but may miss other taxa with higher or lower G+C content. Second, taxa can vary in how much isotope was incorporated, which will affect where each taxon's DNA is in the density gradient.