HTSSIP introduction


HTSSIP is a package for analyzing high throughput sequence (HTS) data from DNA- or RNA-based stable isotope probing (SIP) experiments.

I recommend starting with this introductory vignette, then checking out the other vignettes:

HTS-SIP introduction

The goal of HTS-SIP methods is to accurately determine which taxa (e.g., bacterial 16S rRNA OTUs) have incorporated isotopically labed substrate into biomass. By correctly identifying these 'incorporators', a number of fundamental questions about microbial ecology can be investigated, such as:

Background

Note: I'm going to just refer to DNA-SIP, but most of this also applies to RNA-SIP.

Stable Isotope Probing

In theory, if a taxon incorporates isotopically labeled substrate, then its DNA with be 'heavier' than if that same taxon had instead incorporated unlabeled substrate. By 'heavier', I mean that the DNA with have a higher buoyant density (BD) in a CsCl gradient. So, if you were to load isotopically labeled DNA into one CsCl gradient and unlabeled DNA (from the same taxon) in another CsCl gradient, then the labeled DNA with appear shifted to a heavier BD than the unlabeled DNA. The DNA should be more or less normally distributed in each gradient, so picture a Gaussian distribution that shifts from a 'light' BD to a 'heavy' BD between the control and treatment gradients. This is the basics of stable isotope probing, which is a powerful method because it can be used to detect which taxa in an entire community consumed particular substrates (e.g., cellulose).

If 'heavy' DNA is shifted from 'light' DNA, why are both a control and treatment gradient needed instead of just carefully separating out the 'heavy' DNA band from the 'light' DNA band in just one gradient? While this technique used be done in older SIP studies, the main problem is the BD of DNA is also a factor of G+C content (& some other factors). DNA with a higher G+C content has a 'heavier' BD. So, labeled, low G+C DNA can easily overlap with unlabeled high G+C DNA.

Let's consider a simple DNA-SIP experiment. A basic SIP experiment could consist of incubating aliquots of soil with either 13C-glucose (treatment) or 12C-glucose (the corresponding control). The DNA is then extracted from each soil sample, and heavy DNA is separated (at least partially) from light DNA using CsCl gradient ultra centrifugation. Now, the challenge is to accurately determine the location of each taxon's DNA in each of the 2 gradients in order to identify which taxa 'shifted' in the labeled treatment gradient due to isotope incorporation. Prior to the rise of high throughput sequencing (HTS) methods, this task was very difficult due to the limited throughput and taxonomic resolution of molecular fingerprinting and Sanger sequencing.

HTS-SIP

With high throughput sequencing, the taxonomic composition can be assessed at many points along a CsCl gradient by fractionating the gradient and sequencing the DNA in each fraction (note: I'm just going to focus on 16S rRNA sequencing, but fungal-ITS or metagenomic sequencing are also effective HTS-SIP methods). A large number of gradient fractions can be sequenced, which can help pinpoint the exact BD density range that each OTU occupies. For a typical HR-SIP experiment, ~20-30 fractions from each gradient are often sequenced.

Now remember, we are trying to use HTS-SIP to identify 'BD shifts', which indicate isotope incorporation. This is done by comparing labeled-treatment gradients to their corresponding unlabeled control gradients. Optimally, we'd be able to measure the absolute abundance of each taxon's OTU in each gradient fraction in order to get an idea where the Gaussian distribution of that OTU's DNA is in the gradient.

The major challenge of analyzing HTS data is that the data is compositional, in that the total number of sequences in a sample does not provide any information on the total numbers of each taxon in the sample. This is why relative abundances of OTUs are shown in most microbiome papers. Therefore, we have to try to determine the absolute abundance Gaussian distributions of each OTU in each gradient based on compositional 16S rRNA sequence data. Various HTS-SIP data analysis methods have be developed to address this challenge (e.g., HR-SIP and q-SIP). This R package was developed to easily implement on your own data, so that you can compare the results (and possible develop new methods). For more information on SIP and HTS-SIP, check out these references:

HTSSIP package data

The phyloseq R package provides some great functions for importing and analyzing microbiome data in R. So instead of reinventing the wheel, HTSSIP utilizes Phyloseq objects as input. See the GitHub page or the package website for more info about the phyloseq.

The basic HTS-SIP dataset in the HTSSIP package consists of three CsCl 3 gradients, each with ~24 samples (gradient fractions). There are 2 treatment gradients, which contained DNA extracted from microcosms recieving both cellulose and glucose, but only one substrate was 13C-labeled for each treatment. For example, the 13C-cellulose (13C-Cel) treatment recieved 13C-cellulose and 12C-glucose. This experimental design had the benefit of requiring only one 12C-control that recieved 12C-cellulose and 12C-glucose.

First, let's load some packages including HTSSIP.

library(dplyr)
library(HTSSIP)
library(phyloseq)

Now, let's check out the HTSSIP phyloseq object that we will be using in the other vignettes.

physeq_S2D2

The "S2D2" part of the name stands for: 2 substrates assessed at 2 time points.

Experiment overview

The dataset is from a larger SIP experiment in which labeled substrates were incubated with 15 g of soil in microcosms. The microcosms were destructively sampled at certain numbers of days from the initial substrate addition. The DNA was extracted and loaded into CsCl gradients. Importantly, each microcosm recieved the same amount of all substrates used in the experiment, but only one of the substrates was 13C labeled for each labeled treatment, while the unlabeled controls recieved all 12C unlabeled versions. This design allowed for only one 12C control microcosm per time point (instead of 1 control per labeled substrate). Approximately 24 gradient fractions from each gradient were sequenced with MiSeq 16S rRNA sequencing.

The dataset subset that we'll be using consists of samples 6 CsCl gradients:

  1. 12C-Con, Day 3
  2. 13C-Glu, Day 3
  3. 13C-Cel, Day 3
  4. 12C-Con, Day 14
  5. 13C-Glu, Day 14
  6. 13C-Cel, Day 14

Important note: The dataset used in this example has been filtered such that all OTUs have over 300 reads. This is not a standard operation but has been done here to reduce the size of the dataset and the processing time.

Notes:

Sample metadata

Here's a look at (some) of the sample metadata. Note that Bouyant_density is one of the columns. This is necessary for many of the analyses in HTSSIP}

physeq_S2D2 %>% sample_data %>% .[1:4,1:10]

To check the formatting of the phyloseq object:

# Our phyloseq object should be formatted correctly (no errors)
physeq_S2D2 = physeq_format(physeq_S2D2)
physeq_S2D2 %>% sample_data %>% .[1:4,1:10]

# This phyloseq object should NOT be formatted correctly 
data(GlobalPatterns)
GlobalPatterns %>% sample_data %>% .[1:4,1:ncol(.)]
tryCatch(
  physeq_format(GlobalPatterns),
  error = function(e) e
  )

Parsing the dataset

For most of the analyses in HTSSIP, the gradient fraction samples of each labeled-treatment gradient is compared to its corresponding control. For example, the dataset that we've been looking at has 4 comparisons:

  1. 12C-Con_Day3 vs 13C-Glu_Day3
  2. 12C-Con_Day3 vs 13C-Cel_Day3
  3. 12C-Con_Day14 vs 13C-Glu_Day14
  4. 12C-Con_Day14 vs 13C-Cel_Day14

Note that the same 12C-control is used for both treatments because both recieved glucose & cellulose, but only 1 substrate was isotopically labeled in each.

So to subset the data with the phyloseq package, we can use the prune_samples() function to subset the dataset:

m = sample_data(physeq_S2D2)

physeq_13C.Glu_D3 = prune_samples((m$Substrate=='12C-Con' & m$Day==3) | (m$Substrate=='13C-Glu' & m$Day==3), physeq_S2D2)
physeq_13C.Glu_D3

physeq_13C.Cel_D14 = prune_samples((m$Substrate=='12C-Con' & m$Day==3) | (m$Substrate=='13C-Cel' & m$Day==14), physeq_S2D2)
physeq_13C.Cel_D14

Writing a bunch of redundant code to make each subset is not a good method to subset the data if you have a bunch of subsets to make, so HTSSIP includes some functions to help.

The expression used in the prune_samples() to subset the data can be generalized as:

ex = "(Substrate=='12C-Con' & Day=='${Day}') | (Substrate=='${Substrate}' & Day == '${Day}')"

Now we need the parameters for each individual prune_samples() call. Then, we can iteratively insert these parameters into the expression in order create all of the subsets. This can be accomplished with:

params = get_treatment_params(physeq_S2D2, c('Substrate', 'Day'), "Substrate != '12C-Con'")
params

Let's use the expression and parameters to subset the phyloseq object into a list of phyloseq object subsets:

physeq_S2D2_l = phyloseq_subset(physeq_S2D2, params, ex)
physeq_S2D2_l

With a list of phyloseq objects (1 list item per treatment-control comparison), we can use lapply() or a similar function to run analyses on each subset of the dataset.

References for methods in HTSSIP

Session info

sessionInfo()


Try the HTSSIP package in your browser

Any scripts or data that you put into this service are public.

HTSSIP documentation built on Sept. 14, 2019, 1:02 a.m.