# knitr options
knit_hooks$set(backspace = pkgmaker::hook_backspace())
opts_chunk$set(dpi=300, dev="png")#, dev.args=list(pointsize=15))

IS_PKGDOWN <- nzchar(Sys.getenv('_R_PKGDOWN_RENDER'))
# packages
# pkgmaker::load_project('pancreas/bseq-sc')

# load color theme
COLOR_THEME <- bseqsc:::.ggtheme()

BSEQSCpkg <- 'bseqsc'
if( local_data <- dir.exists(data_dir <- file.path(pkgmaker::find_devpackage('pancreas'), 'docs/data')) ){
  eset <- readRDS(file.path(data_dir, 'GSE50244.rds'))
  eislet <- readRDS(file.path(data_dir, 'islet-eset.rds'))
## How to cite `r BSEQSC` {-}
When using this work in any way, please cite the following work:

> *A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure*<br />
> <small>M. Baron, A. Veres, S.L. Wolock, A. L. Faust, R. Gaujoux, A. Vetere, J. Hyoje Ryu, B. K. Wagner, S. Shen-Orr, A. M. Klein, D. A. Melton, I. Yanai<br /></small>
> Cell Systems. 2016 Oct 26 [10.1016/j.cels.2016.08.011](https://www.ncbi.nlm.nih.gov/pubmed/27667365)



In order to run the entire deconvolution pipeline, users need to install the r Githubpkg('shenorrlab/bseq-sc', 'bseqsc') R package from GitHub. See the installtion instructions for details.


r BSEQSC uses two types of input data:

r BSEQSC analysis pipeline

Single cell RNA-seq information is used to deconvolve bulk RNA-seq samples by estimating the cell-type proportion of key cell-types. Statistical deconvolution is then used to leverage the variation in cell type frequencies between individuals to estimate average cell-type specific expression in 2 groups of individuals and to compute cell type-specific differential expression.

We used this approach to investigate gene expression differences between diabetics and healthy individuals, using single cell RNA-seq data obtained from an independent set of pancreatic islets (Figure 1). However, it is applicable to other type of tissue, for which both bulk and single cell gene expression data are available.


Sample analysis

We reproduce here the analysis detailed in [@Baron2016].

Data preparation

Bioconductor base package provides the ExpressionSet class, which is a convenient data structure to hold expression data along with sample/feature annotation. Here we use two ExpressionSet objects to handle the bulk and single cell data respectively.

Bulk data

The dataset's GEO entry (GSE50244) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use data that we pre-processed and made available on the data download page.

After pre-processing, the data were filtered to remove unmapped tags, or tags that mapped to pseudogenes, or known ribosomal/mithocondrial genes (list available in file data/genes_mitrib.txt.gz or via function getMITRIB() provided in the package). Moreover, counts from transcripts that mapped to a same gene were averaged. Sample phenotypic annotations were also processed to define glycemic groups based on hba1c measurements, as per the original publication [@Fadista2014].

# download GEO dataset from Github
eset <- readRDS('https://shenorrlab.github.io/bseqsc/data/GSE50244.rds')
# for this analysis we only look at samples with hba1c data
eset <- droplevels(eset[, !is.na(eset$hba1c)])

Single cell data

We generated single cell RNA-seq data of pancreatic islets from 3 healthy individuals, which provided gene expression profiles for 17434 genes in 7729 cells. The raw counts for these data are available on the data download page, in the form of an ExpressionSet:

eislet <- readRDS('https://shenorrlab.github.io/bseqsc/data/islet-eset.rds')

After identification of the cell types by clustering, we extract marker genes for each cell type, i.e. sets of genes that are mostly highly expressed in each cell type, which can be done in multiple ways, see for example details in [@Baron2016].

The r BSEQSCpkg package provides the list of selected marker genes for the pancreatic islets, stored in data object pancreasMarkers:


Estimation of cell type proportions

Once cell type-specific markers have been identified, we build a reference basis matrix from their single cell gene expression, which we use to infer the composition of bulk tissue samples from their total gene expression.

Building the reference basis matrix

The purpose of this matrix is to provide reference expression profiles that together discriminate the cell types of interest. We compute it by averaging, within each cell type, the gene expression of each marker gene across all the cells in the cell type.

Notably, the computation is performed on scaled single cell count data, to better reflect cell type-specific variations in average transcript counts. Indeed, looking at the average counts within each cell type shows how cell types have varying levels of RNA transcripts:

# average counts computed within each cell type in each sample
plotCellTotals(eislet, 'cellType', 'sampleID')

Because we want the basis matrix of reference profiles to reflect these biological differences, we use these average counts to re-scale the CPM data before computing the average gene expression.

B <- bseqsc_basis(eislet, pancreasMarkers, clusters = 'cellType', samples = 'sampleID', ct.scale = TRUE)

The resulting basis matrix should show a clear cell type-specific expression pattern (Figure 2):

plotBasis(B, pancreasMarkers, Colv = NA, Rowv = NA, layout = '_', col = 'Blues')

Estimating proportions

Cell type proportions are estimated using CIBERSORT, a method that was developed to investigate immune infiltrating landscape in tumour tissue [@Newman2015; @Gentles2015] (See section Requirements for setup instructions).

fit <- bseqsc_proportions(eset, B, verbose = TRUE)

To facilitate plotting and model fitting in downstream analysis, we add the estimated proportions as extra phenotypic variables to the ExpressionSet:

pData(eset) <- cbind(pData(eset), t(coef(fit)))

Estimated proportions can then be compared between the two HBA1C groups, within each cell type (Figure 3).

  df <- melt(pData(eset), measure.vars = rownames(coef(fit)))
  ggplot(df, aes(x = hba1c_class2, y = value, color = hba1c_class2)) + geom_boxplot(width = .3) + facet_wrap( ~ variable, nrow = 1, scales = 'free') + scale_x_discrete(breaks = NULL) + scale_color_brewer(type = 'qual', palette = 'Set1', direction = -1) + xlab('') +   ylab('Frequency') + guides(color = guide_legend(''))

Gene expression differences adjusted for proportions

Now that we have estimates for the main cell types, we can use r Biocpkg('EdgeR') to derive a set of genes that are likely to be differentially regulated at the cell type level.

To do so, we fit 2 models of gene expression differences between HBA1C groups (IGT, Diabetic, Normal): * a base model only includes the HBA1C group variable, as well as gender and age covariates; * an extended model that additionally includes the proportions of alpha, beta, ductal and acinar cells.

Genes that are likely to be differentially regulated at the cell type level are those whose p-values improve (i.e. decrease) when adjusting for cell type proportions (red dots on Figure 4). Here we used a p-value threshold of 0.01.

To fit these models we use the utility function fitEdgeR included in the package:

# base
fit_edger <- fitEdgeR(eset, ~ gender + ageN + hba1c_class, coef = c('hba1c_classIGT', 'hba1c_classDiabetic'))
# extended
fit_edger_ext <- fitEdgeR(eset, ~ gender + ageN + alpha + beta + ductal + acinar + hba1c_class
                          , coef = c('hba1c_classIGT', 'hba1c_classDiabetic'))

We then visualize

# gather P-values from both models
edger_pvals <- cbind(fit_edger$table[, 'Symbol', drop = FALSE], Base = fit_edger$table$PValue
               , Adjusted = fit_edger_ext$table$PValue[match(fit_edger$table$Symbol, fit_edger_ext$table$Symbol)]
               , stringsAsFactors = FALSE)

edger_pvals <- mutate(edger_pvals, Regulated = Adjusted <= 0.01 & Adjusted <= Base)

# plot Base vs Adjusted
pvalueScatter(Base ~ Adjusted, edger_pvals, pval.th = 0.01, label.th = 3.5)

Cell type-specific differential expression

Now, we run a cell type-specific differential analysis using csSAM [@Shen-Orr2010], on the subset of selected genes, to which we add a set of 6 ER stress-related genes, that were identified in the single cell data analysis (see details in [@Baron2016]).

csSAM fits a linear model that estimates interaction terms between cell proportions and a group variable. This imposes a constraint on the sample size within each group. For this reason we only inlude into the model the 4 cell types (alpha, beta, ductal and acinar cells), either due to their variation (Figure 3) or biological relevance.

# ER genes
genes_ER <- c('HSPA5', 'MAFA', 'HERPUD1', 'DDIT3', 'UCN3', 'NEUROD1')
# Fit on ER stress genes and genes regulated beyond cell type proportion differences
genes <- union(genes_ER, subset(edger_pvals, Regulated)$Symbol)
# NB: we fix seed for reproducibility
csfit <- bseqsc_csdiff(eset[genes, ] ~ gender + ageN + hba1c_class2 | alpha + beta + ductal + acinar,  verbose = 2, nperms = 5000, .rng = 12345)

We observe a strong signal for up-regulation on beta cells (at FDR 0.05), and milder signals of up and down-regulation on acinar and beta cells respectively (at looser FDR 0.25).

p <- plot(csfit, alt = c('great', 'less'), by = 'alt', xlab = 'Number of significant genes', ylab = 'FDR cutoff')
p$data$Cell.type <- gsub(":Hyper", "", p$data$Cell.type)
p$data$Alternative <- c('greater' = 'Up-regulated', less = 'Down-regulated')[as.character(p$data$Alternative)]
p + theme_get()

We filter these results for genes with the strongest signal in alpha, beta and acinar cells:

cst <- csTopTable(csfit, threshold = c(alpha = 0.05, beta = 0.25, acinar = 0.25), alt = 'min', merge = TRUE, n = Inf, nodups = TRUE)

Session Info



shenorrLab/bseqsc documentation built on May 29, 2019, 9:23 p.m.