knitr::opts_chunk$set(echo = TRUE)
Here we download and analyze the Schubert diarrhea dataset (Schubert et al., 2014).
The dataset includes 154 healthy nondiarrheal, 93 Clostridium difficile infection (CDI) associated diarrheal,
and 89 non-CDI associated diarrheal stool samples. We'll download the processed OTU tables from
Zenodo and unzip them in the data/cdi_schubert_results
folder.
Diarrhea comes with a broad community re-structuring, so we expect there to be many truly differentially abundant OTUs in this dataset.
library(dplyr) library(ggplot2) library(magrittr) library(SummarizedBenchmark) ## load helper functions for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) { source(f) } ## project data/results folders datdir <- "data" resdir <- "results" sbdir <- "../../results/microbiome" dir.create(datdir, showWarnings = FALSE) dir.create(resdir, showWarnings = FALSE) dir.create(sbdir, showWarnings = FALSE) cdi_data <- file.path(datdir, "cdi_schubert_results.tar.gz") cdi_dir <- file.path(datdir, "cdi_schubert_results") dir.create(cdi_dir, showWarnings = FALSE) otu_result_file <- file.path(resdir, "schubert-otus-results.rds") otu_bench_file <- file.path(sbdir, "schubert-otus-benchmark.rds") otu_bench_file_abun <- file.path(sbdir, "schubert-otus-abun-benchmark.rds") otu_bench_file_uninf <- file.path(sbdir, "schubert-otus-uninf-benchmark.rds")
if (!file.exists(cdi_data)) { download.file("https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz", destfile = cdi_data) } if (!file.exists(file.path(cdi_dir, "cdi_schubert.metadata.txt"))) { untar(cdi_data, exdir = datdir) }
Next, we'll read in the unzipped OTU table and metadata files into R.
## load OTU table and metadata otu <- read.table(file.path(cdi_dir, "RDP", "cdi_schubert.otu_table.100.denovo.rdp_assigned")) meta <- read.csv(file.path(cdi_dir, "cdi_schubert.metadata.txt"), sep = '\t') # Keep only samples with the right DiseaseState metadata meta <- filter(meta, DiseaseState %in% c("H", "nonCDI", "CDI")) # Keep only samples with both metadata and 16S data keep_samples <- intersect(colnames(otu), meta$sample_id) otu <- otu[, keep_samples] meta <- filter(meta, sample_id %in% keep_samples)
Since we'll be using OTU-wise covariates, we shouldn't need to perform any
filtering/cleaning of the OTUs, apart from removing any that are all zeros.
(This may happen after removing shallow samples, I think.)
We still apply a minimum threshold of 10 reads per OTU across all samples.
After removing these shallow OTUs, we also get rid of any samples with too few reads.
We define the minimum number of reads per OTU in min_otu
, and
the minimum number of reads per sample in min_sample
.
After we've removed any shallow OTUs and samples, we'll convert the OTU table to relative abundances.
min_otu <- 10 minp_otu <- 0.01 min_sample <- 100 ## Remove OTUs w/ <= min reads, w/ <= min prop, samples w/ <= min reads otu <- otu[rowSums(otu) > min_otu, ] otu <- otu[rowSums(otu > 0) / ncol(otu) > minp_otu, ] otu <- otu[, colSums(otu) > min_sample] ## Update metadata with new samples meta <- dplyr::filter(meta, sample_id %in% colnames(otu)) ## Remove empty OTUs otu <- otu[rowSums(otu) > 0, ] ## Convert to relative abundance abun_otu <- t(t(otu) / rowSums(t(otu))) ## Add pseudo counts zeroabun <- 0 abun_otu <- abun_otu + zeroabun
Next, we need to calculate the pvalues, effect size, and standard error for each OTU.
Here, we'll compare diarrhea vs. healthy. Diarrhea will include both CDI and nonCDI
patients. We'll put these results into a dataframe, and label the columns with the
standardized names for downstream use (pval
, SE
, effect_size
, test_statistic
).
The test statistic is the one returned by wilcox.test()
.
Note that the effect here is calculated as logfold change of mean abundance in controls
relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases])
)
While we're at it, we'll also calculate the mean abundance and ubiquity (detection rate)
of each OTU. Later, we can assign their values to a new column called ind_covariate
for use in downstream steps.
if (!file.exists(otu_result_file)) { res <- test_microbiome(abundance = abun_otu, shift = zeroabun, is_case = meta$DiseaseState %in% c("CDI", "nonCDI")) saveRDS(res, file = otu_result_file) } else { res <- readRDS(otu_result_file) }
Add random (uninformative) covariate.
set.seed(9226) res$rand_covar <- rnorm(nrow(res))
Finally, let's try to add phylogeny as covariates. Here we'll have columns for each separate taxonomic level.
res <- tidyr::separate(res, otu, c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"), sep = ";", remove = FALSE)
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=20, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=17, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
Let's look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don't know if that's super legit, and Eric doesn't seem to think that phylogeny will be informative at all...
ggplot(res, aes(x=pval)) + geom_histogram() + facet_wrap(~phylum, scales = "free")
strat_hist(res, pvalue="pval", covariate="rand_covar", maxy=17, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="rand_covar")
Let's use ubiquity
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = ubiquity)
First, we'll create an object of BenchDesign
class to hold the data and
add the benchmark methods to the BenchDesign
object. We remove ASH from
the comparison.
Then, we'll construct the SummarizedBenchmark
object, which will run
the functions specified in each method (these are actually sourced in from the
helper scripts).
if (!file.exists(otu_bench_file)) { bd <- initializeBenchDesign() bd <- dropBMethod(bd, "ashq") sb <- buildBench(bd, data = res, ftCols = "ind_covariate") metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz" saveRDS(sb, file = otu_bench_file) } else { sb <- readRDS(otu_bench_file) }
Next, we'll add the default performance metric for q-value assays. First, we have to rename the assay to 'qvalue'.
assayNames(sb) <- "qvalue" sb <- addDefaultMetrics(sb)
Now, we'll plot the results.
rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE) rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE) plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Hm, now the code runs. However, there are clearly still some issues: - ashs rejects all hypotheses (all q-values are essentially 0). - lfdr and scott-empirical are all NaN (I think this is likely related to the df error)
methods <- c("lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf") plotCovariateBoxplots(sb, alpha = 0.1, nsets = 6, methods = methods)
assays(sb)[["qvalue"]]["ashs"] %>% max() sum(is.na(assays(sb)[["qvalue"]]["lfdr"])) sum(is.na(assays(sb)[["qvalue"]]["scott-empirical"])) sum(is.na(assays(sb)[["qvalue"]]["scott-theoretical"]))
Plotting methods are giving errors for some reason. Let's try to use Alejandro's code instead.
Let's use mean_abun_present
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = mean_abun_present)
First, we'll create an object of BenchDesign
class to hold the data and
add the benchmark methods to the BenchDesign
object. We remove ASH from
the comparison.
Then, we'll construct the SummarizedBenchmark
object, which will run
the functions specified in each method (these are actually sourced in from the
helper scripts).
if (!file.exists(otu_bench_file_abun)) { bd <- initializeBenchDesign() bd <- dropBMethod(bd, "ashq") sb <- buildBench(bd, data = res, ftCols = "ind_covariate") metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz" saveRDS(sb, file = otu_bench_file_abun) } else { sb <- readRDS(otu_bench_file_abun) }
Next, we'll add the default performance metric for q-value assays. First, we have to rename the assay to 'qvalue'.
assayNames(sb) <- "qvalue" sb <- addDefaultMetrics(sb)
Now, we'll plot the results.
rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE) rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE) plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Let's use rand_covar
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = rand_covar)
First, we'll create an object of BenchDesign
class to hold the data and
add the benchmark methods to the BenchDesign
object. We remove ASH from
the comparison.
Then, we'll construct the SummarizedBenchmark
object, which will run
the functions specified in each method (these are actually sourced in from the
helper scripts).
if (!file.exists(otu_bench_file_uninf)) { bd <- initializeBenchDesign() bd <- dropBMethod(bd, "ashq") sb <- buildBench(bd, data = res, ftCols = "ind_covariate") metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz" saveRDS(sb, file = otu_bench_file_uninf) } else { sb <- readRDS(otu_bench_file_uninf) }
Next, we'll add the default performance metric for q-value assays. First, we have to rename the assay to 'qvalue'.
assayNames(sb) <- "qvalue" sb <- addDefaultMetrics(sb)
Now, we'll plot the results.
rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE) rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE) plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Here we compare the method ranks for the different comparisons at alpha = 0.10.
plotMethodRanks(c(otu_bench_file, otu_bench_file_abun, otu_bench_file_uninf), colLabels = c("OTU-ubiquity","OTU-abun", "OTU-uninf"), alpha = 0.10, xlab = "Comparison")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.