%\VignetteEngine{knitr::rmarkdown}

title: "MuSiC: Sample Analysis" output: html_document


Installation

# install devtools if necessary
if (!"devtools" %in% rownames(installed.packages())) {
  install.packages('devtools')
}

# install the MuSiC package
if (!"MuSiC" %in% rownames(installed.packages())) {
  devtools::install_github('xuranw/MuSiC')
}


# load
library(MuSiC)

library(bseqsc)

Examples

We reproduce here the analysis detailed in MuSiC manuscript:

Data Preparation

Bioconduct base package provides ExpressionSet class, which is a convenient data structure to hold expression data along with sample/feature annotation. Here we use two ExpressionSet objects to handle the bulk and single cell data respectively. The details of constructing ExpressionSet can be found on this page.

Bulk data

The dataset's GEO entry (GSE50244) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use the read counts data GSE50244bulkeset.rds on the this page. Please download this file.

# Download GSE50244 dataset from Github
readRDSFromWeb <- function(ref) {
  readRDS(gzcon(url(ref)))
}
GSE50244.bulk.eset = readRDSFromWeb("https://xuranw.github.io/MuSiC/data/GSE50244bulkeset.rds")
GSE50244.bulk.eset
#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 32581 features, 89 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: Sub1 Sub2 ... Sub89 (89 total)
#  varLabels: sampleID SubjectName ... tissue (7 total)
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:  

Single cell data

The single cell data are from ArrayExpression (E-MTAB-5061), which contrains read counts for 25453 genes of 2209 cells. The read counts EMTABesethealthy.rds are available on the this page, in the form of an ExpressionSet. Please download this file.

# Download EMTAB single cell dataset from Github
EMTAB.eset = readRDSFromWeb('https://xuranw.github.io/MuSiC/data/EMTABesethealthy.rds')
EMTAB.eset
#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 25453 features, 1097 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: AZ_A10 AZ_A11 ... HP1509101_P9 (1097 total)
#  varLabels: sampleID SubjectName cellTypeID cellType
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:  

Another single cell data are from GEO entry (GSE81608). There are 39849 genes and 1492 cells. The read counts XinT2Deset.rds are available on this page, in the form of an ExpressionSet. Please download this file.

# Download Xin et al. single cell dataset from Github
XinT2D.eset = readRDSFromWeb('https://xuranw.github.io/MuSiC/data/XinT2Deset.rds')
XinT2D.eset
#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 39849 features, 1492 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: Sample_1 Sample_2 ... Sample_1492 (1492 total)
#  varLabels: sampleID SubjectName ... Disease (5 total)
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:  

Estimation of cell type proportions

In stead of selecting marker genes, MuSiC gives weights to each gene. The weight scheme is based on cross-subject variation: up-weight gene with low variation and down-weight gene with high variation. The composition of bulk tisue samples from there total gene expression is inferred by both cross-subject mean and variance by all genes.

Benchmark Evaluation

Benchmark dataset is constructed by summing up single cell data from XinT2D.eset. The artificial bulk data is constructed through function bulk_construct. The inputs are single cell dataset, cluster name (clusters), sample name (samples) and selected cell type (select.ct). bulk_construct returns a ExpressionSet of artificial bulk dataset Bulk.counts and a matrix of real cell type counts num.real.

# Construct artificial bulk dataset. Use all 4 cell types: alpha, beta, gamma, delta
XinT2D.construct.full = bulk_construct(XinT2D.eset, clusters = 'cellType', samples = 'SubjectName')

# calculate cell type proportions
XinT2D.construct.full$prop.real = relative.ab(XinT2D.construct.full$num.real, by.col = FALSE)

The cell type proportions is estimated by function music_prop. We can specify the way of grouping scRNA-seq data by cluster. samples refers to different subjects and select.ct refers to cell type included (use all cell types in the single cell dataset if not specified). The estimated proportions are normalized based on included cell type.

library(xbioc)

# Estimate cell type proportions of artificial bulk data
Est.prop.Xin = music_prop(bulk.eset = XinT2D.construct.full$Bulk.counts, sc.eset = EMTAB.eset,
                          clusters = 'cellType', samples = 'sampleID', 
                          select.ct = c('alpha', 'beta', 'delta', 'gamma'))
names(Est.prop.Xin)
#[1] "Est.prop.weighted" "Est.prop.allgene"  "Weight.gene"       "r.squared.full"    "Var.prop"  

music_prop not only provides MuSiC estimated cell type proportions (Est.prop.weighted), but also Non-negative least square estimated proportions (Est.prop.allgene). It also returns the weights (Weight.gene), R square (r.squared.full) and variance of estimated cell type proportions (Var.prop).

Compare the real and estimated cell type proportions via different methods.

# Estimate via BSEQ-sc
B.EMTAB.full = bseqsc_basis(EMTAB.eset, pancreasMarkers[c('alpha', 'beta', 'delta', 'gamma')], clusters = 'cellType', samples = 'sampleID', ct.scale = TRUE)
fit.EMTAB <- bseqsc_proportions(XinT2D.construct.full$Bulk.counts, B.EMTAB.full, verbose = TRUE)
Est.prop.bseq.Xin = t(coef(fit.EMTAB))

# Estimation evaluation

Eval_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real), 
           prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted), data.matrix(Est.prop.Xin$Est.prop.allgene), data.matrix(Est.prop.bseq.Xin)), 
           method.name = c('MuSiC', 'NNLS', 'BSEQ-sc'))

#           RMSD     mAD      R
#MuSiC   0.09881 0.06357 0.9378
#NNLS    0.17161 0.11749 0.8159
#BSEQ-sc 0.21741 0.15134 0.7880

Prop_comp_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real), 
                prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted), data.matrix(Est.prop.Xin$Est.prop.allgene), 
                data.matrix(Est.prop.bseq.Xin)), eval = FALSE, 
                method.name = c('MuSiC', 'NNLS', 'BSEQ-sc'), title = 'Heatmap of Real and Est. Proportion from Xin et al.\n Ref: EMTAB healthy' )

Real Bulk Cell Type Estimation

The deconvolution of 89 subjects from Fadista et al. can be realized by bulk data GSE50244.bulk.eset and single cell data EMTAB.eset. We constrained our estimation on 6 major cell type: alpha, beta, delta, gamma, acinar and ductal, which take up over 90% of the whole islet.

# Estimate cell type proportions
Est.prop.GSE50244 = music_prop(bulk.eset = GSE50244.bulk.eset, sc.eset = EMTAB.eset, clusters = 'cellType', 
                               samples = 'sampleID', select.ct = c('alpha', 'beta', 'delta', 'gamma', 'acinar', 'ductal'))

# Estimate via BSEQ-sc
B.EMTAB = bseqsc_basis(EMTAB.eset, pancreasMarkers, clusters = 'cellType', samples = 'sampleID', ct.scale = TRUE)
fit.EMTAB <- bseqsc_proportions(exprs(GSE50244.bulk.eset), B.EMTAB, verbose = TRUE)
Est.prop.bseq = t(coef(fit.EMTAB))

CIBERSORT estimation is abtained by using the design matrix provided by function music_Design.matrix and can be downloaded here.

load(gzcon(url('https://xuranw.github.io/MuSiC/data/GSE50244CIBERSORT.RData')))

# Jitter plot of estimated cell type proportions
m.prop.GSE50244 = rbind(melt(GSE50244.EMTAB.prop$Est.prop.weighted), 
                        melt(GSE50244.EMTAB.prop$Est.prop.allgene), melt(Est.prop.bseq),
                        melt(data.matrix(Est.prop.cibersort)))

colnames(m.prop.GSE50244) = c('Sub', 'CellType', 'Prop')
m.prop.GSE50244$CellType = factor(m.prop.GSE50244$CellType, levels = c('alpha', 'beta', 'delta', 'gamma', 'acinar', 'ductal'))
m.prop.GSE50244$Method = factor(rep(c('MuSiC', 'NNLS', 'BSEQ-sc', 'CIBERSORT'), each = 89*6), levels = c('MuSiC', 'NNLS', 'BSEQ-sc', 'CIBERSORT'))
m.prop.GSE50244$HbA1c = rep(GSE50244.bulk.eset$hba1c, 4*6)
m.prop.GSE50244 = m.prop.GSE50244[!is.na(m.prop.GSE50244$HbA1c), ]
m.prop.GSE50244$Disease = factor(c('H', 'NH')[(m.prop.GSE50244$HbA1c > 6.5)+1], levels = c('H', 'NH'))
m.prop.GSE50244$D = (m.prop.GSE50244$Disease == 'NH')/5
m.prop.GSE50244 = rbind(subset(m.prop.GSE50244, Disease == 'H'), subset(m.prop.GSE50244, Disease != 'H'))

ggplot(m.prop.GSE50244, aes(Method, Prop)) + 
  geom_point(aes(fill = Method, color = Disease, stroke = D, shape = Disease), size = 2, alpha = 0.7, position = position_jitter(width=0.25, height=0)) +
  facet_wrap(~ CellType, scales = 'free') + scale_colour_manual( values = c('white', "gray20")) +
  scale_shape_manual(values = c(21, 24))+ theme_minimal()

Constrain our analysis on beta cells

# Create dataframe for beta cell proportions and HbA1c levels
m.prop.ana = data.frame(Age = rep(pData(GSE50244.bulk.eset)$age, 4), BMI = rep(pData(GSE50244.bulk.eset)$bmi, 4), HbA1c = rep(pData(GSE50244.bulk.eset)$hba1c, 4),
                        Gender = rep(pData(GSE50244.bulk.eset)$gender, 4),
                        ct.prop = c(data.matrix(GSE50244.EMTAB.prop$Est.prop.weighted[, 2]), GSE50244.EMTAB.prop$Est.prop.allgene[, 2], Est.prop.bseq[, 2], Est.prop.cibersort[, 2]), 
                        Method = factor(rep( c('MuSiC', 'NNLS', 'BSEQ-sc', 'CIBERSORT'), each = 89), levels =  c('MuSiC', 'NNLS', 'BSEQ-sc', 'CIBERSORT')) )
m.prop.ana.h = subset(m.prop.ana, !is.na(HbA1c))
m.prop.ana.h$Disease = factor( c('H', 'NH')[(m.prop.ana.h$HbA1c > 6.5) + 1], c('H', 'NH') )
m.prop.ana.h$D = (m.prop.ana.h$Disease == 'NH')/5

ggplot(m.prop.ana.h, aes(HbA1c, ct.prop)) + geom_smooth(method = 'lm',  se = FALSE, col = 'black', lwd = 0.25) +
  geom_point(aes(fill = Method, color = Disease, stroke = D, shape = Disease), size = 2, alpha = 0.7) +  facet_wrap(~ Method) + 
  ggtitle('HbA1c vs Beta Cell Type Proportion') + theme_minimal() + scale_colour_manual( values = c('white', "gray20")) + 
  scale_shape_manual(values = c(21, 24))

MuSiC Estimation Example: Mouse/Rat Kidney

We reproduce here the analysis detailed in MuSiC manuscript:

Data Preparation

Bulk data

The dataset GEO entry (GSE81492) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use the read counts data Mousebulkeset.rds on the this page. Please download this file.

# Download Mouse bulk dataset from Github
Mouse.bulk.eset = readRDSFromWeb('https://xuranw.github.io/MuSiC/data/Mousebulkeset.rds')
Mouse.bulk.eset

#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 19033 features, 10 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: control.NA.27 control.NA.30 ... APOL1.GNA78M (10 total)
#  varLabels: sampleID SubjectName
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:

Single cell data

The single cell data are from GEO entry (GSE107585), which contrains read counts for 16273 genes of 43745 cells. A subset of the read counts Mousesubeset.rds are available on the this page, in the form of an ExpressionSet. This subset contains 16273 genes and 10000 cells. Please download this file.

# Download EMTAB single cell dataset from Github
Mousesub.eset = readRDSFromWeb('https://xuranw.github.io/MuSiC/data/Mousesubeset.rds')
Mousesub.eset

#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 16273 features, 10000 samples 
#  element names: exprs 
#protocolData: none
#phenoData
#  sampleNames: TGGTTCCGTCGGCTCA-2 CGAGCCAAGCGTCAAG-4 ... GAGCAGAGTCAACATC-1 (10000 total)
#  varLabels: sampleID SubjectName cellTypeID cellType
#  varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:  

Notice that the single cell dataset has 16 cell types, including 2 novel cell types and a transition cell type. We exclude those 3

Estimation of cell type proportions

Cluster of Single Cell Data

In the estimation procedure, the first step is to produce design matrix, cross-subject mean of relative abundance, cross-subject variance of relative abundance and average library size from single cell reference. music_basis is the function to produce the design matrix (Disgn.mtx), cross-subject mean of relative abundance (M.theta), cross-subject variance of relative abundance (Sigma), and average library size (M.S).

MuSiC utilized all genes in the next estimation steps and this will lead to collineary of design matrix.

# Produce the first step information
Mousesub.basis = music_basis(Mousesub.eset, clusters = 'cellType', samples = 'sampleID', 
                             select.ct = c('Endo', 'Podo', 'PT', 'LOH', 'DCT', 'CD-PC', 'CD-IC', 'Fib', 'Macro', 'Neutro',
                                           'B lymph', 'T lymph', 'NK'))

# Plot the dendrogram of design matrix and cross-subject mean of realtive abundance
par(mfrow = c(1, 2))
d <- dist(t(log(Mousesub.basis$Disgn.mtx + 1e-6)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1, main = 'Cluster log(Design Matrix)')
d <- dist(t(log(Mousesub.basis$M.theta + 1e-8)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
# hc2 <- hclust(d, method = "complete" )
hc2 <- hclust(d, method = "complete")
# Plot the obtained dendrogram
plot(hc2, cex = 0.6, hang = -1, main = 'Cluster log(Mean of RA)')

The immune cells are clustered together and so are the epithelial cells. Notice that DCT and PT are close to each other. The cut-off is user determined. Here we cut 13 cell types into 4 groups:

MuSiC.cluster is an application of MuSiC to a hierarchical structure, which include 2 steps:

We manually specify the cluster and annotated single cell data with cluster information.

clusters.type = list(C1 = 'Neutro', C2 = 'Podo', C3 = c('Endo', 'CD-PC', 'LOH', 'CD-IC', 'DCT', 'PT'), C4 = c('Macro', 'Fib', 'B lymph', 'NK', 'T lymph'))

cl.type = as.character(Mousesub.eset$cellType)

for(cl in 1:length(clusters.type)){
  cl.type[cl.type %in% clusters.type[[cl]]] = names(clusters.type)[cl]
}
pData(Mousesub.eset)$clusterType = factor(cl.type, levels = c(names(clusters.type), 'CD-Trans', 'Novel1', 'Novel2'))

# 13 selected cell types
s.mouse = unlist(clusters.type)

We then select genes that differential expressed genes within cluster C3 (Epithelial cells) and C4 (Immune cells). Function music_prop.cluster is used for estimation with cluster.

load(gzcon(url('https://xuranw.github.io/MuSiC/data//IEmarkers.RData')))

Est.mouse.bulk = music_prop.cluster(bulk.eset = Mouse.bulk.eset, sc.eset = Mousesub.eset, group.markers = IEmarkers, clusters = 'cellType', groups = 'clusterType', samples = 'sampleID', clusters.type = clusters.type)

The results might be different from the one presented in the manuscript due to incomplete reference single cell dataset.

Reference



xuranw/MuSiC documentation built on March 7, 2024, 11:45 a.m.