In xuranw/MuSiC: Multi-subject single cell deconvolution

%\VignetteEngine{knitr::rmarkdown}

title: "MuSiC: Sample Analysis" output: html_document

Installation

# install devtools if necessary
if (!"devtools" %in% rownames(installed.packages())) {
  install.packages('devtools')
}

# install the MuSiC package
if (!"MuSiC" %in% rownames(installed.packages())) {
  devtools::install_github('xuranw/MuSiC')
}


# load
library(MuSiC)

library(bseqsc)

Examples

We reproduce here the analysis detailed in MuSiC manuscript:

Total gene expression data were obtained from GEO dataset GSE50244, which contains read counts data on gobal transcriptomic analysis of bulk human pancreatic islets to study glucose metabolism in healthy and hyper-glycemic conditions (Fadista et al. 2014).
Single cell RNA-seq data were generated by ArrayExpress dataset E-MTAB-5061, which contains read counts data on single cell mRNA-seq of human pancreatic islets from 10 donors, 6 healthy donors and 4 Type II diabetes donors (Segerstolpe et al. 2016). We used single cell data from 6 healthy donors as reference in deconvolution.
Another single cell RNA-seq data were from Xin et al., which contains read counts data on single cell mRNA-seq of human pancreatic islet endocrine cells from 18 donors, 12 healthy donors and 6 Type II diabetes donors.

Data Preparation

Bioconduct base package provides ExpressionSet class, which is a convenient data structure to hold expression data along with sample/feature annotation. Here we use two ExpressionSet objects to handle the bulk and single cell data respectively. The details of constructing ExpressionSet can be found on this page.

Bulk data

The dataset's GEO entry (GSE50244) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use the read counts data GSE50244bulkeset.rds on the this page. Please download this file.

# Download GSE50244 dataset from Github
readRDSFromWeb <- function(ref) {
  readRDS(gzcon(url(ref)))
}
GSE50244.bulk.eset = readRDSFromWeb("https://xuranw.github.io/MuSiC/data/GSE50244bulkeset.rds")
GSE50244.bulk.eset
#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 32581 features, 89 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: Sub1 Sub2 ... Sub89 (89 total)
#  varLabels: sampleID SubjectName ... tissue (7 total)
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:

Single cell data

The single cell data are from ArrayExpression (E-MTAB-5061), which contrains read counts for 25453 genes of 2209 cells. The read counts EMTABesethealthy.rds are available on the this page, in the form of an ExpressionSet. Please download this file.

# Download EMTAB single cell dataset from Github
EMTAB.eset = readRDSFromWeb('https://xuranw.github.io/MuSiC/data/EMTABesethealthy.rds')
EMTAB.eset
#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 25453 features, 1097 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: AZ_A10 AZ_A11 ... HP1509101_P9 (1097 total)
#  varLabels: sampleID SubjectName cellTypeID cellType
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:

Another single cell data are from GEO entry (GSE81608). There are 39849 genes and 1492 cells. The read counts XinT2Deset.rds are available on this page, in the form of an ExpressionSet. Please download this file.

# Download Xin et al. single cell dataset from Github
XinT2D.eset = readRDSFromWeb('https://xuranw.github.io/MuSiC/data/XinT2Deset.rds')
XinT2D.eset
#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 39849 features, 1492 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: Sample_1 Sample_2 ... Sample_1492 (1492 total)
#  varLabels: sampleID SubjectName ... Disease (5 total)
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:

Estimation of cell type proportions

In stead of selecting marker genes, MuSiC gives weights to each gene. The weight scheme is based on cross-subject variation: up-weight gene with low variation and down-weight gene with high variation. The composition of bulk tisue samples from there total gene expression is inferred by both cross-subject mean and variance by all genes.

Benchmark Evaluation

Benchmark dataset is constructed by summing up single cell data from XinT2D.eset. The artificial bulk data is constructed through function bulk_construct. The inputs are single cell dataset, cluster name (clusters), sample name (samples) and selected cell type (select.ct). bulk_construct returns a ExpressionSet of artificial bulk dataset Bulk.counts and a matrix of real cell type counts num.real.

# Construct artificial bulk dataset. Use all 4 cell types: alpha, beta, gamma, delta
XinT2D.construct.full = bulk_construct(XinT2D.eset, clusters = 'cellType', samples = 'SubjectName')

# calculate cell type proportions
XinT2D.construct.full$prop.real = relative.ab(XinT2D.construct.full$num.real, by.col = FALSE)

The cell type proportions is estimated by function music_prop. We can specify the way of grouping scRNA-seq data by cluster. samples refers to different subjects and select.ct refers to cell type included (use all cell types in the single cell dataset if not specified). The estimated proportions are normalized based on included cell type.

library(xbioc)

# Estimate cell type proportions of artificial bulk data
Est.prop.Xin = music_prop(bulk.eset = XinT2D.construct.full$Bulk.counts, sc.eset = EMTAB.eset,
                          clusters = 'cellType', samples = 'sampleID', 
                          select.ct = c('alpha', 'beta', 'delta', 'gamma'))
names(Est.prop.Xin)
#[1] "Est.prop.weighted" "Est.prop.allgene"  "Weight.gene"       "r.squared.full"    "Var.prop"

music_prop not only provides MuSiC estimated cell type proportions (Est.prop.weighted), but also Non-negative least square estimated proportions (Est.prop.allgene). It also returns the weights (Weight.gene), R square (r.squared.full) and variance of estimated cell type proportions (Var.prop).

Compare the real and estimated cell type proportions via different methods.

# Estimate via BSEQ-sc
B.EMTAB.full = bseqsc_basis(EMTAB.eset, pancreasMarkers[c('alpha', 'beta', 'delta', 'gamma')], clusters = 'cellType', samples = 'sampleID', ct.scale = TRUE)
fit.EMTAB <- bseqsc_proportions(XinT2D.construct.full$Bulk.counts, B.EMTAB.full, verbose = TRUE)
Est.prop.bseq.Xin = t(coef(fit.EMTAB))

# Estimation evaluation

Eval_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real), 
           prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted), data.matrix(Est.prop.Xin$Est.prop.allgene), data.matrix(Est.prop.bseq.Xin)), 
           method.name = c('MuSiC', 'NNLS', 'BSEQ-sc'))

#           RMSD     mAD      R
#MuSiC   0.09881 0.06357 0.9378
#NNLS    0.17161 0.11749 0.8159
#BSEQ-sc 0.21741 0.15134 0.7880

Prop_comp_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real), 
                prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted), data.matrix(Est.prop.Xin$Est.prop.allgene), 
                data.matrix(Est.prop.bseq.Xin)), eval = FALSE, 
                method.name = c('MuSiC', 'NNLS', 'BSEQ-sc'), title = 'Heatmap of Real and Est. Proportion from Xin et al.\n Ref: EMTAB healthy' )

Real Bulk Cell Type Estimation

The deconvolution of 89 subjects from Fadista et al. can be realized by bulk data GSE50244.bulk.eset and single cell data EMTAB.eset. We constrained our estimation on 6 major cell type: alpha, beta, delta, gamma, acinar and ductal, which take up over 90% of the whole islet.

# Estimate cell type proportions
Est.prop.GSE50244 = music_prop(bulk.eset = GSE50244.bulk.eset, sc.eset = EMTAB.eset, clusters = 'cellType', 
                               samples = 'sampleID', select.ct = c('alpha', 'beta', 'delta', 'gamma', 'acinar', 'ductal'))

# Estimate via BSEQ-sc
B.EMTAB = bseqsc_basis(EMTAB.eset, pancreasMarkers, clusters = 'cellType', samples = 'sampleID', ct.scale = TRUE)
fit.EMTAB <- bseqsc_proportions(exprs(GSE50244.bulk.eset), B.EMTAB, verbose = TRUE)
Est.prop.bseq = t(coef(fit.EMTAB))

CIBERSORT estimation is abtained by using the design matrix provided by function music_Design.matrix and can be downloaded here.

load(gzcon(url('https://xuranw.github.io/MuSiC/data/GSE50244CIBERSORT.RData')))

# Jitter plot of estimated cell type proportions
m.prop.GSE50244 = rbind(melt(GSE50244.EMTAB.prop$Est.prop.weighted), 
                        melt(GSE50244.EMTAB.prop$Est.prop.allgene), melt(Est.prop.bseq),
                        melt(data.matrix(Est.prop.cibersort)))

colnames(m.prop.GSE50244) = c('Sub', 'CellType', 'Prop')
m.prop.GSE50244$CellType = factor(m.prop.GSE50244$CellType, levels = c('alpha', 'beta', 'delta', 'gamma', 'acinar', 'ductal'))
m.prop.GSE50244$Method = factor(rep(c('MuSiC', 'NNLS', 'BSEQ-sc', 'CIBERSORT'), each = 89*6), levels = c('MuSiC', 'NNLS', 'BSEQ-sc', 'CIBERSORT'))
m.prop.GSE50244$HbA1c = rep(GSE50244.bulk.eset$hba1c, 4*6)
m.prop.GSE50244 = m.prop.GSE50244[!is.na(m.prop.GSE50244$HbA1c), ]
m.prop.GSE50244$Disease = factor(c('H', 'NH')[(m.prop.GSE50244$HbA1c > 6.5)+1], levels = c('H', 'NH'))
m.prop.GSE50244$D = (m.prop.GSE50244$Disease == 'NH')/5
m.prop.GSE50244 = rbind(subset(m.prop.GSE50244, Disease == 'H'), subset(m.prop.GSE50244, Disease != 'H'))

ggplot(m.prop.GSE50244, aes(Method, Prop)) + 
  geom_point(aes(fill = Method, color = Disease, stroke = D, shape = Disease), size = 2, alpha = 0.7, position = position_jitter(width=0.25, height=0)) +
  facet_wrap(~ CellType, scales = 'free') + scale_colour_manual( values = c('white', "gray20")) +
  scale_shape_manual(values = c(21, 24))+ theme_minimal()

Constrain our analysis on beta cells

# Create dataframe for beta cell proportions and HbA1c levels
m.prop.ana = data.frame(Age = rep(pData(GSE50244.bulk.eset)$age, 4), BMI = rep(pData(GSE50244.bulk.eset)$bmi, 4), HbA1c = rep(pData(GSE50244.bulk.eset)$hba1c, 4),
                        Gender = rep(pData(GSE50244.bulk.eset)$gender, 4),
                        ct.prop = c(data.matrix(GSE50244.EMTAB.prop$Est.prop.weighted[, 2]), GSE50244.EMTAB.prop$Est.prop.allgene[, 2], Est.prop.bseq[, 2], Est.prop.cibersort[, 2]), 
                        Method = factor(rep( c('MuSiC', 'NNLS', 'BSEQ-sc', 'CIBERSORT'), each = 89), levels =  c('MuSiC', 'NNLS', 'BSEQ-sc', 'CIBERSORT')) )
m.prop.ana.h = subset(m.prop.ana, !is.na(HbA1c))
m.prop.ana.h$Disease = factor( c('H', 'NH')[(m.prop.ana.h$HbA1c > 6.5) + 1], c('H', 'NH') )
m.prop.ana.h$D = (m.prop.ana.h$Disease == 'NH')/5

ggplot(m.prop.ana.h, aes(HbA1c, ct.prop)) + geom_smooth(method = 'lm',  se = FALSE, col = 'black', lwd = 0.25) +
  geom_point(aes(fill = Method, color = Disease, stroke = D, shape = Disease), size = 2, alpha = 0.7) +  facet_wrap(~ Method) + 
  ggtitle('HbA1c vs Beta Cell Type Proportion') + theme_minimal() + scale_colour_manual( values = c('white', "gray20")) + 
  scale_shape_manual(values = c(21, 24))

MuSiC Estimation Example: Mouse/Rat Kidney

We reproduce here the analysis detailed in MuSiC manuscript:

Bulk RNA-seq data gene expression data were obtained from GEO dataset GSE81492, which contains read counts data on bulk mRNA-seq of mouse kidney from 10 mouses, 6 healthy and 4 APOL1 mouses (Beckerman et al. 2017).
Single cell RNA-seq gene expression data were obtained from GEO dataset GSE107585, which contains read counts data on single cell mRNA-seq of mouse kidney from 7 male healthy mouse (Park et al. 2018). We used a subset of single cell data as reference in deconvolution.

Data Preparation

Bulk data

The dataset GEO entry (GSE81492) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use the read counts data Mousebulkeset.rds on the this page. Please download this file.

# Download Mouse bulk dataset from Github
Mouse.bulk.eset = readRDSFromWeb('https://xuranw.github.io/MuSiC/data/Mousebulkeset.rds')
Mouse.bulk.eset

#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 19033 features, 10 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: control.NA.27 control.NA.30 ... APOL1.GNA78M (10 total)
#  varLabels: sampleID SubjectName
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:

Single cell data

The single cell data are from GEO entry (GSE107585), which contrains read counts for 16273 genes of 43745 cells. A subset of the read counts Mousesubeset.rds are available on the this page, in the form of an ExpressionSet. This subset contains 16273 genes and 10000 cells. Please download this file.

# Download EMTAB single cell dataset from Github
Mousesub.eset = readRDSFromWeb('https://xuranw.github.io/MuSiC/data/Mousesubeset.rds')
Mousesub.eset

#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 16273 features, 10000 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: TGGTTCCGTCGGCTCA-2 CGAGCCAAGCGTCAAG-4 ... GAGCAGAGTCAACATC-1 (10000 total)
#  varLabels: sampleID SubjectName cellTypeID cellType
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:

Notice that the single cell dataset has 16 cell types, including 2 novel cell types and a transition cell type. We exclude those 3

Estimation of cell type proportions

Cluster of Single Cell Data

In the estimation procedure, the first step is to produce design matrix, cross-subject mean of relative abundance, cross-subject variance of relative abundance and average library size from single cell reference. music_basis is the function to produce the design matrix (Disgn.mtx), cross-subject mean of relative abundance (M.theta), cross-subject variance of relative abundance (Sigma), and average library size (M.S).

MuSiC utilized all genes in the next estimation steps and this will lead to collineary of design matrix.

# Produce the first step information
Mousesub.basis = music_basis(Mousesub.eset, clusters = 'cellType', samples = 'sampleID', 
                             select.ct = c('Endo', 'Podo', 'PT', 'LOH', 'DCT', 'CD-PC', 'CD-IC', 'Fib', 'Macro', 'Neutro',
                                           'B lymph', 'T lymph', 'NK'))

# Plot the dendrogram of design matrix and cross-subject mean of realtive abundance
par(mfrow = c(1, 2))
d <- dist(t(log(Mousesub.basis$Disgn.mtx + 1e-6)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1, main = 'Cluster log(Design Matrix)')
d <- dist(t(log(Mousesub.basis$M.theta + 1e-8)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
# hc2 <- hclust(d, method = "complete" )
hc2 <- hclust(d, method = "complete")
# Plot the obtained dendrogram
plot(hc2, cex = 0.6, hang = -1, main = 'Cluster log(Mean of RA)')

The immune cells are clustered together and so are the epithelial cells. Notice that DCT and PT are close to each other. The cut-off is user determined. Here we cut 13 cell types into 4 groups:

Group 1: Neutro
Group 2: Podo
Group 3: Endo, CD-PC, CD-IC, LOH, DCT, PT
Group 4: Fib, Macro, NK, B lymph, T lymph

MuSiC.cluster is an application of MuSiC to a hierarchical structure, which include 2 steps:

Step 1. Estimate proportions of each cluster; $(p_1,p_2,p_3,p_4)$
Step 2. Estimate cell type proportions within each clusters; $(p_{31},p_{32},.,p_{36},p_{41},.,p_{45})$

We manually specify the cluster and annotated single cell data with cluster information.

clusters.type = list(C1 = 'Neutro', C2 = 'Podo', C3 = c('Endo', 'CD-PC', 'LOH', 'CD-IC', 'DCT', 'PT'), C4 = c('Macro', 'Fib', 'B lymph', 'NK', 'T lymph'))

cl.type = as.character(Mousesub.eset$cellType)

for(cl in 1:length(clusters.type)){
  cl.type[cl.type %in% clusters.type[[cl]]] = names(clusters.type)[cl]
}
pData(Mousesub.eset)$clusterType = factor(cl.type, levels = c(names(clusters.type), 'CD-Trans', 'Novel1', 'Novel2'))

# 13 selected cell types
s.mouse = unlist(clusters.type)

We then select genes that differential expressed genes within cluster C3 (Epithelial cells) and C4 (Immune cells). Function music_prop.cluster is used for estimation with cluster.

load(gzcon(url('https://xuranw.github.io/MuSiC/data//IEmarkers.RData')))

Est.mouse.bulk = music_prop.cluster(bulk.eset = Mouse.bulk.eset, sc.eset = Mousesub.eset, group.markers = IEmarkers, clusters = 'cellType', groups = 'clusterType', samples = 'sampleID', clusters.type = clusters.type)

The results might be different from the one presented in the manuscript due to incomplete reference single cell dataset.

Reference

Segerstolpe, ?. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism 24, 593-607 (2016).
Xin, Y. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell metabolism 24, 608-615 (2016).
Fadista, J. et al. Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism. Proceedings of the National Academy of Sciences 111, 13924-13929 (2014).
Newman, A.M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nature methods 12, 453 (2015).
Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst 3, 346-360 e344 (2016).
Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science, eaar2131 (2018).
Beckerman, P. et al. Transgenic expression of human APOL1 risk variants in podocytes induces kidney disease in mice. Nat Med 23, 429-438 (2017).

xuranw/MuSiC documentation built on March 7, 2024, 11:45 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

xuranw/MuSiC
Multi-subject single cell deconvolution

In xuranw/MuSiC: Multi-subject single cell deconvolution

%\VignetteEngine{knitr::rmarkdown}

Installation

Examples

Data Preparation

Bulk data

Single cell data

Estimation of cell type proportions

Benchmark Evaluation

Real Bulk Cell Type Estimation

MuSiC Estimation Example: Mouse/Rat Kidney

Data Preparation

Bulk data

Single cell data

Estimation of cell type proportions

Cluster of Single Cell Data

Reference

R Package Documentation

Browse R Packages

We want your feedback!

xuranw/MuSiC Multi-subject single cell deconvolution

In xuranw/MuSiC: Multi-subject single cell deconvolution

%\VignetteEngine{knitr::rmarkdown}

Installation

Examples

Data Preparation

Bulk data

Single cell data

Estimation of cell type proportions

Benchmark Evaluation

Real Bulk Cell Type Estimation

MuSiC Estimation Example: Mouse/Rat Kidney

Data Preparation

Bulk data

Single cell data

Estimation of cell type proportions

Cluster of Single Cell Data

Reference

R Package Documentation

Browse R Packages

We want your feedback!

xuranw/MuSiC
Multi-subject single cell deconvolution