In chencxxy28/MultiRD: Multi-robust deconvolution algorithm

knitr::opts_chunk$set(echo = TRUE)

MultiRD unifies multiple deconvolution schemes to infer cell type proportions from the target bulk RNA-seq data. Three unique features are embraced in this algorithm: first, MultiRD is able to incorporate extra biological information from external data sources enables (e.g., scRNA_seq and other bulk RNA_seq data from independent studies); second, MultiRD calibrates the reference-free algorithm by taking into account the proportion estimates from a reference-based approach; third, MultiRD is robust to incorrect information from any of the provided data sources.

Installation

if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("chencxxy28/MultiRD")

In this tutorial, we use SCDC (Dong et al. (2021)) for illustration to conduct reference-based deconvolution. The package can be installed by the code

if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("meichendong/SCDC")

#load
library(MultiRD)
library(SCDC)

Data Input

MultiRD deconvolutes raw read counts data which is in ExpressionSet objects based on the package Biobase. As an illustration of the MultiRD framework, we conduct both simulation studies and real data analysis based on pancreatic islet data from Fadista et al. (2014) as presented in our paper. Three public scRNA-seq datasets from Baron et al. (2016), Segerstolpe et al. (2016) and Xin et al. (2016) are used. For simulations, we have a pre-specified correct and incorrect marker genes; for real data application, we have an expression dataset from Bunt et al. (2015) and pre-selected list of marker genes. All processed data are available at the data download.

Simulation Studies

First, read in scRNA-seq data from our website

readRDSFromWeb <- function(ref) {
  readRDS(gzcon(url(ref)))
}
seger <- readRDSFromWeb("https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/segerstolpe.rds")
baron <- readRDSFromWeb("https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/baron.rds")
xin<-readRDSFromWeb("https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/Xin_nonD.rds")

Second, use function generateBulk() to create the pseudo bulk samples, where ct.varname specifies the name of cell type clustering result variable. sample specifies the name of subjects information variable. ct.sub specifies the names of cell types used to construct pseudo bulk samples. Here we provide an example where the pseudo bulk samples are generated from Segerstolpe et al. (2016).

set.seed(1234567)
pseudo.seger<-generateBulk(seger[["sc.eset.qc"]], ct.varname = "cluster", sample = "sample", ct.sub = c("alpha","beta","delta","gamma"), nbulk = 50, low_s = 0.3, upp_s = 0.7)

The generated pseudo bulk object contains a matrix of true cell type proportions (pseudo.seger$truep) and the ExpressionSet object (pseudo.seger$pseudo_eset).

MultiRD algorithm takes as input the target bulk data, subject-level cell type proportions from reference-based approach, and population-level cell type proportions. For illustration, we use the function SCDC_prop() from SCDC package (Dong et al. (2021)). We regard the estimated cell type proportions from Segerstolpe et al. (2016) as the correct subject-level cell type information (c_refbase_est) and the estimated ones with subject-label shuffle from Xin et al. (2016) as the incorrect information (w_refbase_est).

set.seed(1234567)
#correct subject-level proportions
correct_refest<-SCDC_prop(bulk.eset = pseudo.seger[["pseudo_eset"]], sc.eset = seger[["sc.eset.qc"]], ct.varname = "cluster", sample = "sample", weight.basis = T, ct.sub = c("alpha","beta","delta","gamma"))
c_refbase_est = correct_refest$prop.est.mvw

#incorrect subject-level proportions
incorrect_refest<-SCDC_prop(bulk.eset = pseudo.seger[["pseudo_eset"]], sc.eset = xin, ct.varname = "cluster",sample = "sample", weight.basis = T, ct.sub = c("alpha","beta","delta","gamma"))
w_refbase_est = incorrect_refest$prop.est.mvw
w_refbase_est = w_refbase_est[sample(1:nrow(w_refbase_est)),]

We provide the function evaluate to assess the performance of estimated proportions versus the true proportions based on mean absolute deviance (MAD, the smaller the better), concordance correlation coefficient (CCC, the larger the better), and Pearson correlation coefficient (Pearson, the larger the better).

evaluate(c_refbase_est,pseudo.seger$true_p)$all.eva
evaluate(w_refbase_est,pseudo.seger$true_p)$all.eva

We can obtain population-level mean proportions, with the one using Segerstolpe et al. (2016) as correct (c_ave_est) and the one using Xin et al. (2016) as incorrect (w_ave_est). The function pop.ct.prop.subj() is used to extract such information given subject-level proportions. We also provide the true mean proportions (true_ave) from the target pseudo bulk samples. Besides, MultiRD also requires a list of prespecified marker genes, which could be selected from external scRNA-seq data based on pre-defined expression threshold, prior biological knowledge, or information reported in literature. The list of correct markers (cmarkers.rds) and the list of wrong markers (wmarkers.rds) for this simulation study are available in files cmarkers.rds and wmarkers.rds at data download, respectively.

c_list_marker = readRDSFromWeb("https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/cmarkers.rds") #may write a function for selecting marker genes
w_list_marker = readRDSFromWeb("https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/wmarkers.rds") #may write a function for selecting marker genes
c_ave_est = pop.ct.prop.subj(c_refbase_est)
w_ave_est = pop.ct.prop.subj(w_refbase_est)
true_ave = pop.ct.prop.subj(pseudo.seger$true_p)

true_ave
c_ave_est
w_ave_est

Finally, we integrate all information into MultiRD algorithm to conduct deconvolution for the pseudo bulk data. For illustration, we focus on the case with wrong marker genes (W), wrong subject-level proportions (W), and correct population-level mean proportions (R). The function MultiRD.onegroup() will return a list containing estimated cell type proportions corresponding to each tuning value (est.prop) and a sequence of goodness-of-fit values corresponding to each tuning value (metrics). The smaller the better.

MultiRD.object<-MultiRD.onegroup(bulk.data=pseudo.seger[["pseudo_eset"]],list.marker=w_list_marker,celltype.unique=c("alpha","beta","delta","gamma"),subject.level.proportion=w_refbase_est,population.level.proportion=c_ave_est)

The function MultiRD.predict.props() helps extract the recovered cell type proportions via MultiRD; the function reffree.predict.prop() helps extract the estimated proportions via reference-free approach. We use the function evaluate to assess the performance of estimated proportions versus the true proportions based on MAD, CCC, and Pearson. We also provide the visualization via a dot plot using MAD and CCC. The following results show that our method (named MultiRD) has smaller MAD and higher CCC and Pearson than other methods even when misleading information is used.

MultiRD.est = MultiRD.predict.prop(MultiRD.output=MultiRD.object)
reffree.est = reffree.predict.prop(MultiRD.output=MultiRD.object)

evaluate(MultiRD.est,pseudo.seger$true_p)$all.eva
evaluate(reffree.est,pseudo.seger$true_p)$all.eva
evaluate(w_refbase_est,pseudo.seger$true_p)$all.eva

library(reshape2)
library(ggplot2)
data = as.data.frame(cbind(Methods= c("MultiRD","Ref-free","SCDC","MultiRD","Ref-free","SCDC","MultiRD","Ref-free","SCDC","MultiRD","Ref-free","SCDC"), Cell_Types=c("Alpha","Alpha","Alpha","Beta","Beta","Beta", "Delta","Delta","Delta","Gamma","Gamma","Gamma")))
MultiRD_results = evaluate(MultiRD.est,pseudo.seger$true_p)$cell.type.eva
reffree_results = evaluate(reffree.est,pseudo.seger$true_p)$cell.type.eva
SCDC_results = evaluate(w_refbase_est,pseudo.seger$true_p)$cell.type.eva
data$MAD = c(rbind(MultiRD_results$ct.mad,reffree_results$ct.mad,SCDC_results$ct.mad))
data$CCC = c(rbind(MultiRD_results$ct.ccc,reffree_results$ct.ccc,SCDC_results$ct.ccc))
ggplot(data, aes(x= Cell_Types, y=Methods, size=MAD, color=CCC, group= Cell_Types)) + geom_point(alpha = 0.8) + theme_classic() + scale_color_gradient(low = "yellow2",  high = "blue", space = "Lab", limit = c(-0.001, 0.5)) + scale_size(range = c(2, 10))+ ggtitle("WWR")+ theme(text = element_text(size=15),plot.title = element_text(hjust = 0.5))

Real Bulk RNA-seq Data

Now we move forward to analyze a bulk RNA-seq data generated from pancreatic islets by Fadista et al. (2014). We focus on 77 samples with 51 considered as healthy (hbA1C level no larger than 6) and 26 considered as having type 2 diabetes (T2D; hbA1C larger than 6). To obtain the subject-level cell type proportions, we adopted SCDC-ENSAMBLE approach, a SCDC-based method aimed to implement deconvolution via integrating multiple single-cell reference sets (Dong et al. (2021)). scRNA-seq data from Baron et al. (2016) and Segerstolpe et al. (2016) are considered. To allow the potentially different gene expression patterns between the cases and controls, we implemented ENSEMBLE procedures for the samples from the two classes separately. (may take 5-10 minutes to run)

fadista_77 <- readRDSFromWeb("https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/fadista_77.rds")
#obtain subject-level proportions
#for normal samples
index_normal<-which(fadista_77@phenoData@data$hba1c_class2=="Normal")
fadista.healthy.ens <- SCDC_ENSEMBLE(bulk.eset = fadista_77[,index_normal], sc.eset.list = list(baronh = baron$sc.eset.qc, segerh = seger$sc.eset.qc), ct.varname = "cluster", sample = "sample", truep = NULL, ct.sub =  c("alpha","beta","delta","gamma","acinar","ductal"), search.length = 0.01, grid.search = T)
prop_est_normal<-wt_prop(fadista.healthy.ens$w_table[7,1:2], proplist = fadista.healthy.ens$prop.only)

#for diseased samples
index_nnormal<-which(fadista_77@phenoData@data$hba1c_class2!="Normal")
fadista.t2d.ens <- SCDC_ENSEMBLE(bulk.eset = fadista_77[,index_nnormal], sc.eset.list = list(baronh = baron$sc.eset.qc, segerh = seger$sc.eset.qc), ct.varname = "cluster", sample = "sample", truep = NULL, ct.sub =  c("alpha","beta","delta","gamma","acinar","ductal"), search.length = 0.01, grid.search = T)
prop_est_t2d <- wt_prop(fadista.t2d.ens$w_table[7,1:2], proplist=fadista.t2d.ens$prop.only)

#combine two groups 
prop_est = rbind(prop_est_normal, prop_est_t2d)
prop_est = prop_est[match(colnames(fadista_77),rownames(prop_est)),]
comb_sample = prop_est

Next, to obtain the population-level mean proportions, we applied SCDC-ENSEMBLE again to the external bulk expression data from Bunt et al. (2015) at data download and then averaged subject-level proportions for each cell type (Not run).

Bunt <- readRDSFromWeb("https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/Bunt.rds")
colnames(Bunt@phenoData@data)[1]<-"sample"
#obtain population-level mean proportions
bunt.ens <- SCDC_ENSEMBLE(bulk.eset = Bunt, sc.eset.list = list(baronh = baron$sc.eset.qc, segerh = seger$sc.eset.qc), ct.varname = "cluster", sample = "sample", truep = NULL, ct.sub =  c("alpha","beta","delta","gamma","acinar","ductal"), search.length = 0.01, grid.search = T)
prop_est_bunt<-wt_prop(bunt.ens$w_table[7,1:2], proplist = bunt.ens$prop.only)
ave_est = pop.ct.prop.subj(prop_est_bunt)

The code above may take a long time to run. We have provided the estimated cell-type proportions, named prop_est_bunt.rds at data download. We can directly use the estimates to calculate population-level mean proportions.

prop_est_bunt = readRDSFromWeb("https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/prop_est_bunt.rds")
ave_est = pop.ct.prop.subj(prop_est_bunt)

Finally, given a list of pre-specified marker genes, named fad_list_marker.rds available at data download, we applied MultiRD algorithm to conduct deconvolution. To allow the potentially different gene expression patterns between the cases and controls, we implemented ENSEMBLE procedures for the samples from the two classes separately (may take 5-10 minutes to run).

#read the list of marker genes
list_marker = readRDSFromWeb('https://github.com/chencxxy28/MultiRD/raw/master/vignettes/data/fad_list_marker.rds')

#normal samples
MultiRD.object.normal = MultiRD.onegroup(bulk.data=fadista_77[,index_normal],list.marker=list_marker,celltype.unique=c("alpha","beta","delta","gamma","acinar","ductal"),subject.level.proportion=comb_sample[index_normal,],population.level.proportion=ave_est, tol=0.0001)
prop.est.normal.MultiRD = MultiRD.predict.prop(MultiRD.output=MultiRD.object.normal)
prop.est.normal.reffree = reffree.predict.prop(MultiRD.output=MultiRD.object.normal)

#diseased samples
MultiRD.object.nnormal = MultiRD.onegroup(bulk.data=fadista_77[,index_nnormal],list.marker=list_marker,celltype.unique=c("alpha","beta","delta","gamma","acinar","ductal"),subject.level.proportion=comb_sample[index_nnormal,],population.level.proportion=ave_est, tol=0.0001)
prop.est.nnormal.MultiRD = MultiRD.predict.prop(MultiRD.output=MultiRD.object.nnormal)
prop.est.nnormal.reffree = reffree.predict.prop(MultiRD.output=MultiRD.object.nnormal)

#combine two groups for MultiRD
MultiRD.est<-comb_sample
MultiRD.est[index_normal,]<-prop.est.normal.MultiRD
MultiRD.est[index_nnormal,]<-prop.est.nnormal.MultiRD

#combine two groups for reffree
reffree.est<-comb_sample
reffree.est[index_normal,]<-prop.est.normal.reffree
reffree.est[index_nnormal,]<-prop.est.nnormal.reffree

To compare the results from MultiRD to other methods, we visualize the estimated proportions as follows:

MultiRD.est = as.data.frame(MultiRD.est)
MultiRD.est$id = rownames(MultiRD.est)
reffree.est = as.data.frame(reffree.est)
reffree.est$id = rownames(reffree.est)
SCDC.est = comb_sample
SCDC.est = as.data.frame(SCDC.est)
SCDC.est$id = rownames(SCDC.est)
Groups = fadista_77@phenoData@data$hba1c_class2
Groups = as.data.frame(Groups)
Groups$id = rownames(MultiRD.est)

est_MultiRD<-melt(data = MultiRD.est, id.vars = c("id"))
est_scdc<-melt(data = SCDC.est, id.vars = c("id"))
est_reffree<-melt(data = reffree.est, id.vars = c("id"))
colnames(est_MultiRD)<-colnames(est_scdc)<- colnames(est_reffree)<-c("id", "Cell", "value" )
plot.data<-rbind(est_MultiRD,est_scdc,est_reffree)

plot.data$Methods<-c(rep("MultiRD",462),rep("SCDC",462), rep("Ref-free",462))
plot.data<-merge(plot.data,Groups,by="id")
plot.data$order<-ifelse(plot.data$Groups=="Normal",1,2)
plot.data<-plot.data[order(plot.data$order),]

ggplot(data=plot.data,aes(x=Methods, y=value)) + 
  geom_point(aes(fill = Methods, shape=Groups, color=Groups), size = 2, alpha = 0.8,position = position_jitter(width=0.25, height=0))+ 
  scale_colour_manual( values = c( 'grey80',"Black")) +
  scale_shape_manual(values = c(21,24))+
  facet_wrap(~ Cell, ncol = 2) +ylim(0,1)+
  ylab("Cell-type Proportions")+ theme_bw()+ 
  theme(text = element_text(size=15),legend.position="bottom")+ 
  scale_fill_discrete(name="Methods", guide = FALSE)

To evaluate the negative correlation between HbA1c levels and the beta cell functions, we constructed a generalized additive model using the estimated beta cell-type proportions as responses, some covariates (age, BMI, gender) as linear effects, and HbA1c as non-linear effect. The results from MultiRD are presented as follows.

library(mgcv)
library(reshape2)
library(ggplot2)
age<-fadista_77@phenoData@data[["age"]]
bmi<-fadista_77@phenoData@data[["bmi"]]
gender<-fadista_77@phenoData@data[["gender"]]
hba1c<-fadista_77@phenoData@data[["hba1c"]]
fit_gam<-gam(MultiRD.est[,'beta']~age+bmi+gender+s(hba1c),family=betar(link="logit"))
data_predict<-data.frame(age=0,bmi=0,gender="Female",hba1c=hba1c)

#plot
fit_plot<-gam(MultiRD.est[,'beta']~s(hba1c),family=betar(link="logit"))
Groups<-ifelse(hba1c<=6,"Normal","T2D")
data_plot<-data.frame(Cell_proportions=MultiRD.est[,'beta'],hbaic=fadista_77@phenoData@data[["hba1c"]],fitted=fitted(fit_plot))
nice_plot<-ggplot(data=data_plot,aes(x = hbaic, y = Cell_proportions))+ylim(0,0.7)+geom_point(size=1.8,aes(shape = Groups,color=Groups))+geom_line(aes(y = fitted), size = 1, color="darkgreen")+xlab("HbA1c")+ylab("Beta Cell Proportions")+ theme(text = element_text(size=15),legend.position = "none")
nice_plot