knitr::opts_chunk$set(echo = TRUE)

Summary

For the application of differential gene expression, we look at scRNAseq and bulk RNAseq data. Here we discuss the analyses of scRNAseq.

library(knitr)
df=data.frame(datasets=c("GSE84465","GSE94383"),
              organism=c("human","mouse"),
              technology=c("smart-seq","smart-seq2"),
              comparison=c("Tumor core vs Periphery",
                           "LPS vs No stimulation"))
kable(df,caption="scRNAseq Datasets")
df1=data.frame(DE_method=c("scDD","MAST"),
               input=c("count","tpm"),
               output=c("p-val","p-val"))
kable(df1,caption="DE methods")

We selected two datasets, one each from mouse and human. These datasets come from the Conquer [@soneson2018] scRNAseq data collection. Conquer provides uniformly processed gene expression data summarized to both count (aggregated from transcript members), and length-scaled TPMs at the gene level. The authors used salmon to quantify raw reads and tximport to summarize to gene level. The data comes from Smartseq, a plate based sequencing method which results in deeper sequencing of reads at the cost of numbers of cells.

Here we are interested in performing differential expression analyses to determine which genes have different expression among the two biological groups in each comparison. We also apply two different analysis methods: MAST [@finak2015] and scDD [@korthauer2016]. In this vignette, we will explore the human dataset analysed with MAST.

Workspace setup

The downloaded plate data are stored in R objects called MultiAssayExperiment; this package requires BioConductor release 3.5 and R version 3.4 or greater for installation.

Note that this Rmd also relies on scDD version 1.3.4 or higher (install from Bioconductor devel, or from the master repo on GitHub kdkorthauer/scDD), since additional functionality to skip computation of the classification of genes into patterns was implemented.

library(MultiAssayExperiment)
library(SingleCellExperiment)
library(dplyr)
library(ggplot2)
library(cowplot)
library(impute)
library(scDD)
library(EBSeq)
library(BiocParallel)

if( !packageVersion("scDD") >= '1.3.4'){
  stop("This code relies on scDD version 1.3.4 or higher - see note above")
}

## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
    source(f)
}

# data and results directories
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/scRNAseq"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)

# results files
resfile_mean <- file.path(sbdir, "mouse-benchmark-scdd-mean.rds")
resfile_det <- file.path(sbdir, "mouse-benchmark-scdd-det.rds")
resfile_uninf <- file.path(sbdir, "mouse-benchmark-scdd-uninf.rds")

# set up parallel backend
cores <- 8
BiocParallel::register(MulticoreParam(workers = cores))

Data preparation

Data download

We first download the data in the data directory.conquer provides the processed datasets in convenient .rds files. We download the Mouse data from GSE94383

datfile <- file.path(datdir,"GSE94383.rds")

if (!file.exists(datfile)){
  download.file(url="http://imlspenticton.uzh.ch/robinson_lab/conquer/data-mae/GSE94383.rds",
    destfile=datfile)
}

Data processing and filtering

For data processing, we follow practices and recommendations from [@lun2016]. Briefly, we perform quality control filtering on cells to ensure that technical artifacts do not distort downstream analysis results. In addition, we also filter out very low-abundance genes, since genes that contain almost all zero counts do not contain enough information for reliable inference. Specifically, we filter out cells with extremely low mapping rate (below 20%), as these represent poor quality cells that may end up having negative size factor estimates during normalization. For the same reason, we also filter cells which have extremely low proportion of genes detected (less than 5%). We also filter out genes that are expressed in fewer than 5% of cells in both groups.

This dataset (from PMC28396000 [@lane2017]) contains cells from a murine macrophage cell line that were studied for the impact of stimulation of the innate immune transcription factor nuclear factor $\kappa$B (NF-$\kappa$B) on gene expression. Specifically, it is of interest to characterize the dynamics of global gene expression on activating immune response. The NF-$\kappa$B was stimulated using a lipopolysaccharide (LPS)-dependent subunit fused to a fluorescent protein. Here we extract and compare two conditions and compare unstimulated cells versus cells stimulated with LPS after 150 minutes.

datfile <- file.path(datdir, "mouse_GSE94383_scdd.rds")

if (!file.exists(datfile)){

  # load MultiAssayExperiment
  mousedata=readRDS(file.path(datdir,"GSE94383.rds"))

  # exclude cells with extremely low mapping rate or too few expressed features
  # (poor quality cells)

  # exclude cells with extremely low mapping or feature rate (poor quality cells)
  lowMap <- which(metadata(mousedata)$salmon_summary$percent_mapped < 20 |
                  colMeans(assay(experiments(mousedata)[[1]], "count") > 0) < 0.05)
  if (length(lowMap) > 0){
    mousedata <- mousedata[, -lowMap]
    message("Removed ", length(lowMap), " cells with mapping rate below 20%",
            " or fewer than 5% of genes detected.")
  }

  # subset by groups of interest
  mousedata <- mousedata[, (colData(mousedata)$characteristics_ch1.2 ==
                              "condition: No stimulation") |
                           (colData(mousedata)$characteristics_ch1.2 ==
                              "condition: LPS" &
                            colData(mousedata)$characteristics_ch1.1 %in% 
                              "time point: 150 min") ]
  cdata=colData(mousedata)
  cdata$group <- cdata$characteristics_ch1.2

  # subset by gene expression TPM & count assays
  mousedata <- experiments(mousedata)[[1]]
  assays(mousedata) <- assays(mousedata)[c("TPM", "count")]
  colData(mousedata) <- cdata

  # subset by genes with detection rate > 5% in at least one condition
  levs <- unique(colData(mousedata)$group)
  mousedata <- mousedata[rowMeans(assay(mousedata, "count")[, 
                            colData(mousedata)$group == levs[1]] > 0) > 0.05 |
                         rowMeans(assay(mousedata, "count")[, 
                            colData(mousedata)$group == levs[2]] > 0) > 0.05, ]

  # remove spike-in controls
  mousedata <- mousedata[!grepl("ERCC-", rownames(mousedata)),]

  # save SummarizedExperiment 
  saveRDS(mousedata, datfile)
}else{
  mousedata <- readRDS(datfile) 
}

# look at number of cells in each group
table(as.character(colData(mousedata)$group))

For the mouse dataset2, we have a total of r ncol(mousedata) cells and r nrow(mousedata) genes.

Data analysis

Differential testing

We perform differential expression testing using scDD which implicitly models any heterogeneity in expression.

We apply scDD to normalized log-transformed counts. The p-value results are added to the rowData slot of the SummarizedExperiment data objects to ensure that gene metadata stays intact.

# function to add scDD p-values
compute_scdd_sc <- function(dat){
  # check if already computed
  if (! "scdd_p" %in% colnames(rowData(dat))){
    sce <- SingleCellExperiment(assays=list(count=assay(dat, "count")),
                                colData=data.frame(colData(dat)))
    # scran normalization
    libsize <- scran::computeSumFactors(assay(sce, "count"))
    normcounts(sce) <- GetNormalizedMat(assay(sce, "count"), libsize)

    res <- results(scDD(sce, 
             condition = "group",
             prior_param = list(alpha=0.01, mu0=0, s0=0.01, a0=0.01, b0=0.01), 
             testZeroes = TRUE,
             categorize = FALSE))
    rowData(dat)$scdd_p <- res$combined.pvalue
  }else{
     message("scDD p-values already computed.")
  }
  return(dat)
}

mousedata <- compute_scdd_sc(mousedata)

# save results
saveRDS(mousedata, datfile)

Covariate Diagnostics

In scrnaseq data, strength of the signal can be a affected by the of level of gene expression. We will explore two potential covariates related to expression level: mean nonzero expression and detection rate (defined as the proportion of cells expressing the gene at a nonzero level). In addition, we'll add a random (uninformative covariate). These will also be added to the rowData slot of the SummarizedExperiments.

add_covariates_scrnaseq <- function(dat){
  rowData(dat)$meanExp <- apply(assay(dat, "count"), 1, 
                                function(x) mean(x[x>0]))
  rowData(dat)$detection <- rowMeans(assay(dat, "count") > 0)
  rowData(dat)$rand_covar <- rnorm(nrow(dat))
  return(dat)
}

set.seed(99764)
mousedata <- add_covariates_scrnaseq(mousedata)

# save results
saveRDS(mousedata, datfile)

Covariate one: Mean Nonzero Expression

For each covariate, we'll examine the covariate diagnostic plots.

rank_scatter(data.frame(rowData(mousedata)), 
             pvalue="scdd_p", covariate="meanExp") + 
  ggtitle("scDD, Covariate 1: Mean Nonzero Expression")
strat_hist(data.frame(rowData(mousedata)), 
           pvalue="scdd_p", covariate="meanExp", maxy=20,
           main = "scDD, Covariate 1: Mean Nonzero Expression") 

The mean nonzero expression appears to be informative and approximately satisfies the assumptions

Covariate two: Detection Rate

Next we look at the detection rate covariate.

rank_scatter(data.frame(rowData(mousedata)), 
             pvalue="scdd_p", covariate="detection") + 
  ggtitle("scDD, Covariate 2: Detection Rate")
strat_hist(data.frame(rowData(mousedata)), 
           pvalue="scdd_p", covariate="detection", maxy=20,
           main = "scDD, Covariate 2: Detection Rate") 

Covariate three: Random

Next we look at the random rate covariate.

rank_scatter(data.frame(rowData(mousedata)), 
             pvalue="scdd_p", covariate="rand_covar") + 
  ggtitle("scDD, Covariate 3: Random")
strat_hist(data.frame(rowData(mousedata)), 
           pvalue="scdd_p", covariate="rand_covar", maxy=20,
           main = "scDD, Covariate 3: Random") 

Multiple-Testing Correction

First, we'll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object.

bd <- initializeBenchDesign()

Now, we're ready to construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

Covariate one: Mean nonzero expression

First we'll include the mean nonzero expression covariate.

if (!file.exists(resfile_mean)){
  sb1 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% 
                                   filter(!is.na(scdd_p)) %>%
                                   mutate(pval=scdd_p,
                                          ind_covariate=meanExp), 
                          ftCols = c("meanExp"),
                          parallel=TRUE)

  saveRDS(sb1, file = resfile_mean)
}else{
  sb1 <- readRDS(resfile_mean)
}

Covariate two: Detection Rate

Now, we'll repeat the multiple testing correction using the detection rate covariate:

if (!file.exists(resfile_det)){
  sb2 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% 
                                   filter(!is.na(scdd_p)) %>%
                                   mutate(pval=scdd_p,
                                          ind_covariate=detection), 
                          ftCols = c("detection"),
                          parallel=TRUE)

  saveRDS(sb2, file = resfile_det)
}else{
  sb2 <- readRDS(resfile_det)
}

Covariate three: Random

Now, we'll repeat the multiple testing correction using the random covariate:

if (!file.exists(resfile_uninf)){
  sb3 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% 
                                   filter(!is.na(scdd_p)) %>%
                                   mutate(pval=scdd_p,
                                          ind_covariate=rand_covar), 
                          ftCols = c("rand_covar"),
                          parallel=TRUE)

  saveRDS(sb3, file = resfile_uninf)
}else{
  sb3 <- readRDS(resfile_uninf)
}

Benchmark Metrics

Next, we'll add the default performance metric for q-value assays and plot the results. We'll start with covariate one.

Covariate one: Mean Nonzero Expression

First, we have to rename the assay to 'qvalue'.

# rename assay to qvalue
assayNames(sb1) <- "qvalue"
sb1 <- addDefaultMetrics(sb1)

Now, we'll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb1,
                   supplementary=FALSE) +
  ggtitle("scDD, Covariate 1: Mean Nonzero Expression")

rejection_scatter_bins(sb1, covariate="meanExp", bins=4,
                       supplementary=FALSE) +
  ggtitle("scDD, Covariate 1: Mean Nonzero Expression")

# upset plot 
plotFDRMethodsOverlap(sb1, 
                      alpha=0.05, nsets=ncol(sb1),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE) 
mcols(sb1)$ind_covariate <- mcols(sb1)$meanExp
covariateLinePlot(sb1, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans = "log1p") +
  ggtitle("scDD, Covariate 1: Mean Nonzero Expression")

Covariate two: Detection Rate

Next, we'll look at the performance metrics for the detection rate covariate.

# rename assay to qvalue
assayNames(sb2) <- "qvalue"
sb2 <- addDefaultMetrics(sb2)

Now, we'll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb2, supplementary=FALSE) +
  ggtitle("scDD, Covariate 2: Detection Rate")

rejection_scatter_bins(sb2, covariate="detection", bins=4,
                       supplementary=FALSE) +
  ggtitle("scDD, Covariate 2: Detection Rate")

# upset plot 
plotFDRMethodsOverlap(sb2, 
                      alpha=0.05, nsets=ncol(sb2),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)
mcols(sb2)$ind_covariate <- mcols(sb2)$detection
covariateLinePlot(sb2, alpha=0.05, covname="ind_covariate", nbins=25) +
  ggtitle("scDD, Covariate 2: Detection Rate")

Covariate three: Random

Next, we'll look at the performance metrics for the random covariate.

# rename assay to qvalue
assayNames(sb3) <- "qvalue"
sb3 <- addDefaultMetrics(sb3)

Now, we'll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb3, supplementary=FALSE) +
  ggtitle("scDD, Covariate 3: Random")

rejection_scatter_bins(sb3, covariate="rand_covar", bins=4,
                       supplementary=FALSE) +
  ggtitle("scDD, Covariate 3: Random")

# upset plot 
plotFDRMethodsOverlap(sb3, 
                      alpha=0.05, nsets=ncol(sb3),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)
mcols(sb3)$ind_covariate <- mcols(sb3)$rand_covar
covariateLinePlot(sb3, alpha=0.05, covname="ind_covariate", nbins=25) +
  ggtitle("scDD, Covariate 3: Random")

Covariate comparison

Here we compare the method ranks for the two covariates at alpha = 0.10.

plotMethodRanks(c(resfile_mean, resfile_det, resfile_uninf), 
                colLabels = c("Mean nonzero", "Detection rate", "Uninf"), 
                alpha = 0.10, xlab = "Covariate")

Session Information

sessionInfo()

References



stephaniehicks/benchmarkfdrData2019 documentation built on June 20, 2021, 10 a.m.