knitr::opts_chunk$set(echo = TRUE)
This guide will demonstrate the usage of bindSC to align scRNA-seq and protein data. The test dataset is from bone marrow (BM) samples measured using the CITE-seq technology(Stuart et al., Cell 2019 [https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8]).
Remove cells that have no variation in initilized gene score matrix Z
library(bindSC) bm <- readRDS(file="../../data/cite_seq.RDS") summary(bm) dim(bm$RNA) dim(bm$Protein) dim(bm$Z) tp <- colSums(as.matrix(bm$Z)) cell_sel <- which(tp>2) N <- length(cell_sel) y <- bm$RNA[, cell_sel] x <- bm$Protein[, cell_sel] z0 <- bm$Z[,cell_sel] dim(x) dim(y) dim(z0)
Given the feature dimension of X and Z0 is very low (=24), we use orignal profiles for integration in this task.
There are two key parameters that may influence the integraton results: 1) lambda
the modality weighting factor; 2) alpha
the weighting factor of initilized gene score matrix. The parameter optimization
step will run bindS alignment with 0<lambda<1
and 0<alpha<1
. This may take a long time if the cell number is large. You can use use the downsampling procedure to run this step. We found that the defualt settings with lambda = 0.5 and alpha = 0.5 work well on most integration tasks. The parameter optimization
step could be skipped if you want to use the default settings.
To run the parameter optimization
step, you need to prepare for the cluster annotaions for each dataset.
x.clst <- bm$meta$celltype2[cell_sel] y.clst <- bm$meta$celltype2[cell_sel] run = FALSE if(run){ down_sample <- seq(1,N,10) paraSel <- BiCCA_para_opt( X = x[, down_sample] , Y = y$dt1[, down_sample], Z0 =z0[, down_sample], tp="out", X.clst = x.clst[down_sample], Y.clst = y.clst[down_sample], alpha.lst = seq(0,1,0.1), K.lst = c(10), lambda.lst = seq(0,1,0.1), num.iteration = 50, tolerance = 0.01, save = TRUE, block.size = 1000 ) p1 <- paraSel_plot(paraSel) p1 }
We use lambda = 0.7 and alpha = 0.1 (alpah<0.3 has the best peformance). The iteration process will be done in ~5 mins .
res <- BiCCA( X = x , Y = y, Z0 =z0, X.clst = x.clst, Y.clst = y.clst, alpha = 0.1, lambda = 0.7, K = 15, temp.path = "out", num.iteration = 50, tolerance = 0.01, save = TRUE, parameter.optimize = FALSE, block.size = 0)
You can check the iteration process using plotIteration
function (optional). The first row shows objective function costs for three terms separately. The fourth figure shows the total objective function cost over the iteration time. The fifth figure shows the relative change of gene score matrix Z over the iteration time.
p2 <- plotIteration(res) p2 tiff("iteration.cost.tiff", width=12, height =6, res =300, units = "in") print(p2) dev.off()
res$u
and res$r
denote coordinates of two modalites in co-embeddings.celltype <- as.character(bm$meta$celltype2[cell_sel]) umap_plt <- umap(rbind(res$u, res$r)) umap_plt <- data.frame("UMAP1"=umap_plt$layout[,1], "UMAP2"=umap_plt$layout[,2], "celltype" = c(celltype, celltype), "data" = c(rep("protein",length(celltype)), rep("RNA",length(celltype)))) xlim <- c(min(umap_plt$UMAP1), max(umap_plt$UMAP1)) ylim <- c(min(umap_plt$UMAP2), max(umap_plt$UMAP2)) p11 <- UMAP_plot(meta = umap_plt[umap_plt$data=="protein",], color = "celltype", xlim = xlim, ylim = ylim, mylabel = paletteDiscrete(umap_plt$celltype) ) p12 <- UMAP_plot(meta = umap_plt[umap_plt$data=="RNA",], color = "celltype", xlim = xlim, ylim = ylim, mylabel=paletteDiscrete(umap_plt$celltype)) p11 <- p11 + ggtitle("Protein") p12 <- p12 + ggtitle("RNA") p3 <- ggarrange(p11,p12, ncol=2) print(p3)
Given the cell correspondence between two datasets is known, we can validate the imputation accuracy of the gene score matrix Z derived from bindSC.
p5 <- plot_geneScoreChange(X=x, Z0 = z0, Z_impu = res$Z_est) p5 tiff("matZ.improve.tiff", width=9, height =4, res =300, units = "in", compression = "lzw") print(p5) dev.off() p5
Next, we shos the imputation accuracy after bindSC for each protein marker. bindSC increases the accuracy of expression profile for all protein markers. The imputed gene expresion profile could be used to explore interaction between gene expresion and protein level.
cor1 <- diag(cor(as.matrix(t(x)),as.matrix(t(z0)), method = "pearson")) cor2 <- diag(cor(as.matrix(t(x)),as.matrix(t(res$Z_est)), method = "pearson")) plt_dt <- data.frame("protein"=names(cor1), "cor1"=cor1, "cor2"=cor2) p51 <- ggscatter(plt_dt, x="cor1", y = "cor2", color = "protein", label="protein", repel =TRUE) + xlab("Accuracy of initilized Z") + ylab("Accuracy of imputed Z") + theme(legend.position = "none") + xlim(0,1) + ylim(0,1) + geom_abline(intercept = 0, slope = 1, linetype="dashed") p51 tiff("imputAccuracy.improve.tiff", width=4, height =4, res =300, units = "in",compression = "lzw") print(p51) dev.off()
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.