knitr::opts_chunk$set(echo = TRUE)
For the application of differential gene expression, we look at scRNAseq and bulk RNAseq data. Here we discuss the analyses of scRNAseq.
library(knitr) df=data.frame(datasets=c("GSE84465","GSE94383"), organism=c("human","mouse"), technology=c("smart-seq","smart-seq2"), comparison=c("Tumor core vs Periphery", "LPS vs No stimulation")) kable(df,caption="scRNAseq Datasets") df1=data.frame(DE_method=c("scDD","MAST"), input=c("count","tpm"), output=c("p-val","p-val")) kable(df1,caption="DE methods")
We selected two datasets, one each from mouse and human.
These datasets come from the Conquer
[@soneson2018] scRNAseq data collection. Conquer
provides uniformly processed gene expression data summarized to both count (aggregated from transcript members), and length-scaled TPMs at the gene level. The authors used salmon
to quantify raw reads and tximport
to summarize to gene level. The data comes from Smartseq, a plate based sequencing method which results in deeper sequencing of reads at the cost of numbers of cells.
Here we are interested in performing differential expression analyses to determine which genes have different expression among the two biological groups in each comparison. We also apply two different analysis methods: MAST [@finak2015] and scDD [@korthauer2016]. In this vignette, we will explore the human dataset analysed with MAST.
The downloaded plate data are stored in R objects called MultiAssayExperiment
; this package requires BioConductor release 3.5 and R version 3.4 or greater for installation.
Note that this Rmd also relies on scDD version 1.3.4 or higher (install from Bioconductor devel, or from the master repo on GitHub kdkorthauer/scDD), since additional functionality to skip computation of the classification of genes into patterns was implemented.
library(MultiAssayExperiment) library(SingleCellExperiment) library(dplyr) library(ggplot2) library(cowplot) library(impute) library(scDD) library(EBSeq) library(BiocParallel) if( !packageVersion("scDD") >= '1.3.4'){ stop("This code relies on scDD version 1.3.4 or higher - see note above") } ## load helper functions for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) { source(f) } # data and results directories datdir <- "data" resdir <- "results" sbdir <- "../../results/scRNAseq" dir.create(datdir, showWarnings = FALSE) dir.create(resdir, showWarnings = FALSE) dir.create(sbdir, showWarnings = FALSE) # results files resfile_mean <- file.path(sbdir, "mouse-benchmark-scdd-mean.rds") resfile_det <- file.path(sbdir, "mouse-benchmark-scdd-det.rds") resfile_uninf <- file.path(sbdir, "mouse-benchmark-scdd-uninf.rds") # set up parallel backend cores <- 8 BiocParallel::register(MulticoreParam(workers = cores))
We first download the data in the data directory.conquer
provides the processed datasets in convenient .rds files. We download the Mouse data from GSE94383
datfile <- file.path(datdir,"GSE94383.rds") if (!file.exists(datfile)){ download.file(url="http://imlspenticton.uzh.ch/robinson_lab/conquer/data-mae/GSE94383.rds", destfile=datfile) }
For data processing, we follow practices and recommendations from [@lun2016]. Briefly, we perform quality control filtering on cells to ensure that technical artifacts do not distort downstream analysis results. In addition, we also filter out very low-abundance genes, since genes that contain almost all zero counts do not contain enough information for reliable inference. Specifically, we filter out cells with extremely low mapping rate (below 20%), as these represent poor quality cells that may end up having negative size factor estimates during normalization. For the same reason, we also filter cells which have extremely low proportion of genes detected (less than 5%). We also filter out genes that are expressed in fewer than 5% of cells in both groups.
This dataset (from PMC28396000 [@lane2017]) contains cells from a murine macrophage cell line that were studied for the impact of stimulation of the innate immune transcription factor nuclear factor $\kappa$B (NF-$\kappa$B) on gene expression. Specifically, it is of interest to characterize the dynamics of global gene expression on activating immune response. The NF-$\kappa$B was stimulated using a lipopolysaccharide (LPS)-dependent subunit fused to a fluorescent protein. Here we extract and compare two conditions and compare unstimulated cells versus cells stimulated with LPS after 150 minutes.
datfile <- file.path(datdir, "mouse_GSE94383_scdd.rds") if (!file.exists(datfile)){ # load MultiAssayExperiment mousedata=readRDS(file.path(datdir,"GSE94383.rds")) # exclude cells with extremely low mapping rate or too few expressed features # (poor quality cells) # exclude cells with extremely low mapping or feature rate (poor quality cells) lowMap <- which(metadata(mousedata)$salmon_summary$percent_mapped < 20 | colMeans(assay(experiments(mousedata)[[1]], "count") > 0) < 0.05) if (length(lowMap) > 0){ mousedata <- mousedata[, -lowMap] message("Removed ", length(lowMap), " cells with mapping rate below 20%", " or fewer than 5% of genes detected.") } # subset by groups of interest mousedata <- mousedata[, (colData(mousedata)$characteristics_ch1.2 == "condition: No stimulation") | (colData(mousedata)$characteristics_ch1.2 == "condition: LPS" & colData(mousedata)$characteristics_ch1.1 %in% "time point: 150 min") ] cdata=colData(mousedata) cdata$group <- cdata$characteristics_ch1.2 # subset by gene expression TPM & count assays mousedata <- experiments(mousedata)[[1]] assays(mousedata) <- assays(mousedata)[c("TPM", "count")] colData(mousedata) <- cdata # subset by genes with detection rate > 5% in at least one condition levs <- unique(colData(mousedata)$group) mousedata <- mousedata[rowMeans(assay(mousedata, "count")[, colData(mousedata)$group == levs[1]] > 0) > 0.05 | rowMeans(assay(mousedata, "count")[, colData(mousedata)$group == levs[2]] > 0) > 0.05, ] # remove spike-in controls mousedata <- mousedata[!grepl("ERCC-", rownames(mousedata)),] # save SummarizedExperiment saveRDS(mousedata, datfile) }else{ mousedata <- readRDS(datfile) } # look at number of cells in each group table(as.character(colData(mousedata)$group))
For the mouse dataset2, we have a total of r ncol(mousedata)
cells and
r nrow(mousedata)
genes.
We perform differential expression testing using scDD which implicitly models any heterogeneity in expression.
We apply scDD to normalized log-transformed counts.
The p-value results are added to the rowData
slot of the SummarizedExperiment
data objects to ensure that gene metadata stays intact.
# function to add scDD p-values compute_scdd_sc <- function(dat){ # check if already computed if (! "scdd_p" %in% colnames(rowData(dat))){ sce <- SingleCellExperiment(assays=list(count=assay(dat, "count")), colData=data.frame(colData(dat))) # scran normalization libsize <- scran::computeSumFactors(assay(sce, "count")) normcounts(sce) <- GetNormalizedMat(assay(sce, "count"), libsize) res <- results(scDD(sce, condition = "group", prior_param = list(alpha=0.01, mu0=0, s0=0.01, a0=0.01, b0=0.01), testZeroes = TRUE, categorize = FALSE)) rowData(dat)$scdd_p <- res$combined.pvalue }else{ message("scDD p-values already computed.") } return(dat) } mousedata <- compute_scdd_sc(mousedata) # save results saveRDS(mousedata, datfile)
In scrnaseq data, strength of the signal can be a affected by the of level of gene expression.
We will explore two potential covariates related to expression level: mean nonzero expression
and detection rate (defined as the proportion of cells expressing the gene at a nonzero level).
In addition, we'll add a random (uninformative covariate).
These will also be added to the rowData
slot of the SummarizedExperiments
.
add_covariates_scrnaseq <- function(dat){ rowData(dat)$meanExp <- apply(assay(dat, "count"), 1, function(x) mean(x[x>0])) rowData(dat)$detection <- rowMeans(assay(dat, "count") > 0) rowData(dat)$rand_covar <- rnorm(nrow(dat)) return(dat) } set.seed(99764) mousedata <- add_covariates_scrnaseq(mousedata) # save results saveRDS(mousedata, datfile)
For each covariate, we'll examine the covariate diagnostic plots.
rank_scatter(data.frame(rowData(mousedata)), pvalue="scdd_p", covariate="meanExp") + ggtitle("scDD, Covariate 1: Mean Nonzero Expression")
strat_hist(data.frame(rowData(mousedata)), pvalue="scdd_p", covariate="meanExp", maxy=20, main = "scDD, Covariate 1: Mean Nonzero Expression")
The mean nonzero expression appears to be informative and approximately satisfies the assumptions
Next we look at the detection rate covariate.
rank_scatter(data.frame(rowData(mousedata)), pvalue="scdd_p", covariate="detection") + ggtitle("scDD, Covariate 2: Detection Rate")
strat_hist(data.frame(rowData(mousedata)), pvalue="scdd_p", covariate="detection", maxy=20, main = "scDD, Covariate 2: Detection Rate")
Next we look at the random rate covariate.
rank_scatter(data.frame(rowData(mousedata)), pvalue="scdd_p", covariate="rand_covar") + ggtitle("scDD, Covariate 3: Random")
strat_hist(data.frame(rowData(mousedata)), pvalue="scdd_p", covariate="rand_covar", maxy=20, main = "scDD, Covariate 3: Random")
First, we'll create an object of BenchDesign
class to hold the data and
add the benchmark methods to the BenchDesign
object.
bd <- initializeBenchDesign()
Now, we're ready to construct the SummarizedBenchmark
object, which will run
the functions specified in each method (these are actually sourced in from the
helper scripts).
First we'll include the mean nonzero expression covariate.
if (!file.exists(resfile_mean)){ sb1 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% filter(!is.na(scdd_p)) %>% mutate(pval=scdd_p, ind_covariate=meanExp), ftCols = c("meanExp"), parallel=TRUE) saveRDS(sb1, file = resfile_mean) }else{ sb1 <- readRDS(resfile_mean) }
Now, we'll repeat the multiple testing correction using the detection rate covariate:
if (!file.exists(resfile_det)){ sb2 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% filter(!is.na(scdd_p)) %>% mutate(pval=scdd_p, ind_covariate=detection), ftCols = c("detection"), parallel=TRUE) saveRDS(sb2, file = resfile_det) }else{ sb2 <- readRDS(resfile_det) }
Now, we'll repeat the multiple testing correction using the random covariate:
if (!file.exists(resfile_uninf)){ sb3 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% filter(!is.na(scdd_p)) %>% mutate(pval=scdd_p, ind_covariate=rand_covar), ftCols = c("rand_covar"), parallel=TRUE) saveRDS(sb3, file = resfile_uninf) }else{ sb3 <- readRDS(resfile_uninf) }
Next, we'll add the default performance metric for q-value assays and plot the results. We'll start with covariate one.
First, we have to rename the assay to 'qvalue'.
# rename assay to qvalue assayNames(sb1) <- "qvalue" sb1 <- addDefaultMetrics(sb1)
Now, we'll plot the results.
# plot nrejects by method overall and stratified by covariate rejections_scatter(sb1, supplementary=FALSE) + ggtitle("scDD, Covariate 1: Mean Nonzero Expression") rejection_scatter_bins(sb1, covariate="meanExp", bins=4, supplementary=FALSE) + ggtitle("scDD, Covariate 1: Mean Nonzero Expression") # upset plot plotFDRMethodsOverlap(sb1, alpha=0.05, nsets=ncol(sb1), order.by="freq", decreasing=TRUE, supplementary=FALSE)
mcols(sb1)$ind_covariate <- mcols(sb1)$meanExp covariateLinePlot(sb1, alpha=0.05, covname="ind_covariate", nbins=25, trans = "log1p") + ggtitle("scDD, Covariate 1: Mean Nonzero Expression")
Next, we'll look at the performance metrics for the detection rate covariate.
# rename assay to qvalue assayNames(sb2) <- "qvalue" sb2 <- addDefaultMetrics(sb2)
Now, we'll plot the results.
# plot nrejects by method overall and stratified by covariate rejections_scatter(sb2, supplementary=FALSE) + ggtitle("scDD, Covariate 2: Detection Rate") rejection_scatter_bins(sb2, covariate="detection", bins=4, supplementary=FALSE) + ggtitle("scDD, Covariate 2: Detection Rate") # upset plot plotFDRMethodsOverlap(sb2, alpha=0.05, nsets=ncol(sb2), order.by="freq", decreasing=TRUE, supplementary=FALSE)
mcols(sb2)$ind_covariate <- mcols(sb2)$detection covariateLinePlot(sb2, alpha=0.05, covname="ind_covariate", nbins=25) + ggtitle("scDD, Covariate 2: Detection Rate")
Next, we'll look at the performance metrics for the random covariate.
# rename assay to qvalue assayNames(sb3) <- "qvalue" sb3 <- addDefaultMetrics(sb3)
Now, we'll plot the results.
# plot nrejects by method overall and stratified by covariate rejections_scatter(sb3, supplementary=FALSE) + ggtitle("scDD, Covariate 3: Random") rejection_scatter_bins(sb3, covariate="rand_covar", bins=4, supplementary=FALSE) + ggtitle("scDD, Covariate 3: Random") # upset plot plotFDRMethodsOverlap(sb3, alpha=0.05, nsets=ncol(sb3), order.by="freq", decreasing=TRUE, supplementary=FALSE)
mcols(sb3)$ind_covariate <- mcols(sb3)$rand_covar covariateLinePlot(sb3, alpha=0.05, covname="ind_covariate", nbins=25) + ggtitle("scDD, Covariate 3: Random")
Here we compare the method ranks for the two covariates at alpha = 0.10.
plotMethodRanks(c(resfile_mean, resfile_det, resfile_uninf), colLabels = c("Mean nonzero", "Detection rate", "Uninf"), alpha = 0.10, xlab = "Covariate")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.