knitr::opts_chunk$set(echo = TRUE)

This guide will demonstrate the usage of bindSC to align scRNA-seq and protein data. The test dataset is from bone marrow (BM) samples measured using the CITE-seq technology(Stuart et al., Cell 2019 [https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8]).

Stage 1: Load the scRNA-seq and protein data

Remove cells that have no variation in initilized gene score matrix Z

library(bindSC)

bm <- readRDS(file="../../data/cite_seq.RDS")
summary(bm)
dim(bm$RNA)
dim(bm$Protein)
dim(bm$Z)
tp <- colSums(as.matrix(bm$Z))
cell_sel <- which(tp>2)
N <- length(cell_sel)
y <- bm$RNA[, cell_sel]
x <- bm$Protein[, cell_sel]
z0 <- bm$Z[,cell_sel]
dim(x)
dim(y)
dim(z0)

Given the feature dimension of X and Z0 is very low (=24), we use orignal profiles for integration in this task.

Stage 2: Parameter optimization [optional]

There are two key parameters that may influence the integraton results: 1) lambda the modality weighting factor; 2) alpha the weighting factor of initilized gene score matrix. The parameter optimization step will run bindS alignment with 0<lambda<1 and 0<alpha<1. This may take a long time if the cell number is large. You can use use the downsampling procedure to run this step. We found that the defualt settings with lambda = 0.5 and alpha = 0.5 work well on most integration tasks. The parameter optimization step could be skipped if you want to use the default settings.

To run the parameter optimization step, you need to prepare for the cluster annotaions for each dataset.

x.clst <- bm$meta$celltype2[cell_sel]
y.clst <- bm$meta$celltype2[cell_sel]
run = FALSE
if(run){
  down_sample <- seq(1,N,10)
  paraSel <- BiCCA_para_opt( X = x[, down_sample] ,
               Y = y$dt1[, down_sample], 
               Z0 =z0[, down_sample], 
               tp="out",
               X.clst = x.clst[down_sample],
               Y.clst = y.clst[down_sample],
               alpha.lst = seq(0,1,0.1), 
               K.lst = c(10),
               lambda.lst = seq(0,1,0.1),
               num.iteration = 50,
               tolerance = 0.01,
               save = TRUE,
               block.size = 1000
  )
  p1 <- paraSel_plot(paraSel)
  p1
}

Stage 4: Run bindSC

We use lambda = 0.7 and alpha = 0.1 (alpah<0.3 has the best peformance). The iteration process will be done in ~5 mins .

res <- BiCCA( X = x ,
             Y =  y, 
             Z0 =z0, 
             X.clst = x.clst,
             Y.clst = y.clst,
             alpha = 0.1, 
             lambda = 0.7,
             K = 15,
             temp.path  = "out",
             num.iteration = 50,
             tolerance = 0.01,
             save = TRUE,
             parameter.optimize = FALSE,
             block.size = 0)

You can check the iteration process using plotIteration function (optional). The first row shows objective function costs for three terms separately. The fourth figure shows the total objective function cost over the iteration time. The fifth figure shows the relative change of gene score matrix Z over the iteration time.

p2 <- plotIteration(res)
p2

tiff("iteration.cost.tiff", width=12, height =6, res =300, units = "in")
print(p2)
dev.off()

Stage 4: Project cells from two modalites in the co-embedding space. res$u and res$r denote coordinates of two modalites in co-embeddings.

celltype <- as.character(bm$meta$celltype2[cell_sel])
umap_plt <- umap(rbind(res$u, res$r))
umap_plt  <- data.frame("UMAP1"=umap_plt$layout[,1],
                        "UMAP2"=umap_plt$layout[,2],
                        "celltype" = c(celltype, celltype),
                        "data" = c(rep("protein",length(celltype)),
                                   rep("RNA",length(celltype))))

xlim <- c(min(umap_plt$UMAP1), max(umap_plt$UMAP1))
ylim <- c(min(umap_plt$UMAP2), max(umap_plt$UMAP2))

p11 <- UMAP_plot(meta = umap_plt[umap_plt$data=="protein",], 
                color = "celltype", xlim = xlim, ylim = ylim, mylabel = paletteDiscrete(umap_plt$celltype) )
p12 <- UMAP_plot(meta = umap_plt[umap_plt$data=="RNA",], 
              color = "celltype", xlim = xlim, ylim = ylim, mylabel=paletteDiscrete(umap_plt$celltype))

p11 <- p11 + ggtitle("Protein")
p12 <- p12 + ggtitle("RNA")

p3 <- ggarrange(p11,p12, ncol=2)
print(p3)

Given the cell correspondence between two datasets is known, we can validate the imputation accuracy of the gene score matrix Z derived from bindSC.

p5 <- plot_geneScoreChange(X=x, Z0 = z0, Z_impu = res$Z_est)

p5


tiff("matZ.improve.tiff", width=9, height =4, res =300, units = "in", compression = "lzw")
print(p5)
dev.off()
p5

Next, we shos the imputation accuracy after bindSC for each protein marker. bindSC increases the accuracy of expression profile for all protein markers. The imputed gene expresion profile could be used to explore interaction between gene expresion and protein level.

cor1 <- diag(cor(as.matrix(t(x)),as.matrix(t(z0)), method = "pearson"))
cor2 <- diag(cor(as.matrix(t(x)),as.matrix(t(res$Z_est)), method = "pearson"))
plt_dt <- data.frame("protein"=names(cor1),
                     "cor1"=cor1, "cor2"=cor2)
p51 <- ggscatter(plt_dt, x="cor1", y = "cor2", color = "protein", 
                label="protein", repel =TRUE)  + 
  xlab("Accuracy of initilized Z") + ylab("Accuracy of imputed Z")  + 
  theme(legend.position = "none") + xlim(0,1) + ylim(0,1) + 
  geom_abline(intercept = 0, slope = 1, linetype="dashed")
 p51



tiff("imputAccuracy.improve.tiff", width=4, height =4, res =300, units = "in",compression = "lzw")
print(p51)
dev.off()
sessionInfo()


KChen-lab/bindSC documentation built on Sept. 29, 2022, 4:24 a.m.