knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
This guide will demonstrate the usage of bindSC to jointly define cell types by leveraging mulitple single-cell modalities. Here we define cell types that incorporate both gene expression and chromatin accessibility data. The dataset is generated from the sci-CAR technology (Cao et al., 2018, science).
For convenience, we have prepared the pre-processed data which are ready to use.
User can refer to A549_preprocess.ATAC.html
and A549_preprocess.RNA.html
for the details of running the pre-processing workflow (It will take 10 ~ 20 mins). For RNA data, there are 6,005 cells measured on 4,759 genes (A549_RNA$X
). For ATAC data, there are 3,628 cells measured on 13,515 genes (gene activity estimated from peak; A549_ATAC$Z0
) and 24,953 peaks (A549_ATAC$Y
). User can download the preprocessed sci-RNA data from
https://drive.google.com/file/d/1tjMA9yERXYBx-GKKrGORKF1-4Zs9xScM/view?usp=sharing
The preprocessed sci-ATAC data from https://drive.google.com/file/d/1REKXV-TTm1rkWEyX82Q1R6dOtE-dPsza/view?usp=sharing
library(bindSC) A549_RNA <- readRDS("../../data/A549_rna.rds") A549_ATAC <- readRDS("../../data/A549_atac.rds") summary(A549_RNA) summary(A549_ATAC) dim(A549_RNA$X) dim(A549_ATAC$Z0) dim(A549_ATAC$Y)
We then visualize each cell, colored by cell type, from two technologies. Left is from scRNA-seq and right is from scATAC-seq. For both technologies, cells from 0 h and 1/3 h can be well separated in 2D-UMAP.
library(ggpubr) p1<-ggscatter(A549_RNA$RNA_meta, x = "UMAP_1", y = "UMAP_2", color = "cell_type", palette = c("darkseagreen4","lightpink","darkorchid1"), repel = FALSE,size=0.5, alpha=0.5,legend.title = "", title="scRNA-seq", font.title=16) p2<-ggscatter(A549_ATAC$ATAC_meta, x = "UMAP_1", y = "UMAP_2", color = "group", palette = c("darkseagreen4","lightpink","darkorchid1"), repel = FALSE,size=0.5, alpha=0.5,legend.title = "", title="scATAC-seq", font.title=16) p1<-p1+rremove("axis") + rremove("ticks") + rremove("xylab")+ rremove("axis.text") p2<-p2+rremove("axis") + rremove("ticks") + rremove("xylab")+ rremove("axis.text") p<-ggarrange(p1, p2, nrow = 1, common.legend = TRUE, legend = "right") p
We next perform bindSC alignment on A549 data. In generall, bindSC requires three matrices as input:
X
: gene expression matrix from scRNA-seq data Y
: chromatin accessibility matrix from scATAC-seq data Z0
: initilized gene activity matrix from scATAC-seq data The gene activity matrix Z0
is estimated by counting the total number of ATAC-seq reads within the gene body/promoter region of each gene in each cell.
X <- A549_RNA$X Y <- A549_ATAC$Y Z0 <- A549_ATAC$Z0 treatmentTime <- c(A549_RNA$RNA_meta$treatment_time, A549_ATAC$ATAC_meta$group) treatmentTime[treatmentTime=="A549_0h"] <- 0 treatmentTime[treatmentTime=="A549_1h"] <- 1 treatmentTime[treatmentTime=="A549_3h"] <- 3 type <- c(rep("scRNA-seq",nrow(A549_RNA$RNA_meta)), rep("scATAC-seq",nrow(A549_ATAC$ATAC_meta)))
Make sure X
and Z0
have matched gene features, Y
and Z0
have matched cell names.
gene.overlap <- intersect(rownames(X), rownames(Z0)) cell.overlap <- intersect(colnames(Y), colnames(Z0)) X <- as.matrix(X[gene.overlap,]) Z0 <- as.matrix(Z0[gene.overlap, cell.overlap]) Y <- as.matrix(Y[,cell.overlap])
Generally, you can run bindSC using above inputs. However, the feature dimension is high for single cell epigenetic profiles (for example, >100k peaks for scATAC-seq data). Here we use low-dimension representations rather than the orignal matrixs for bindSC alignment. You should perform dimension reductions (SVD) on X
and Z0
jointly.
out <- dimReduce( dt1 = X, dt2 = Z0, K = 30) x <- out$dt1 z0 <- out$dt2 y <- dimReduce(dt1 = Y, K=30)
There are two key parameters that may influence the integraton results: 1) lambda
modality weighting factor; 2) alpha
weighting factor of initilized gene score matrix. The parameter optimization
step will run bindS alignment with 0<lambda<1
and 0<alpha<1
. This may take a long time if the sampe size is large.We found the defualt settings with lambda = 0.5 and alpha = 0.5 work well on most integration tasks. You can skip the parameter optimization
step if you want to use the default settings.
To run the parameter optimization
step, you need to prepare for the cluster annotaion information for each dataset.
X.clst <- treatmentTime[type=="scRNA-seq"] Y.clst <- treatmentTime[type=="scATAC-seq"] # Given cells from 1/3hs are not separatable in A549 dataset, we only consider pre/post treatments x.clst <- X.clst y.clst <- Y.clst x.clst[x.clst>0] <- 1 y.clst[y.clst>0] <- 1 run <- FALSE if(run){ paraSel <- BiCCA_para_opt( X = t(x) , Y = t(y), Z0 =t(z0), X.clst = x.clst, Y.clst = y.clst, alpha.lst= seq(0,1,0.1), K.lst = c(5), lambda.lst = seq(0,1,0.1), num.iteration = 50, tolerance = 0.01, save = TRUE, block.size = 1000 ) }
Use paraSel_plot
function to show three integration metrics change with alpha
and lambda
.
Overall, the mixing alignment metric (<0.02 difference) is robust to alpha
and lambda
and silhouette score is very low when alpha=1
and lambda=1
.
run <- FALSE if(run){ p1 <- paraSel_plot(paraSel) p1 tiff("paraSel.tiff", width=12, height =3, res =300, units = "in") print(p1) dev.off() }
We use defualt settings with lambda = 0.5 and alpha = 0.1. The iteration process will be done in ~10s.
res <- BiCCA( X = t(x) , Y = t(y), Z0 =t(z0), X.clst = x.clst, Y.clst = y.clst, alpha = 0.1, lambda = 0.5, K = 5, temp.path = "out", num.iteration = 50, tolerance = 0.01, save = TRUE, parameter.optimize = FALSE, block.size = 0)
Check the iteration process using plotIteration
function (optional). The first row shows the objective function costs for 3 terms separately. The fourth figure shows the total objective function over iteration time. The fifth figure figure shows the relative change of gene score matrix Z over iteration time.
p2 <- plotIteration(res) p2 tiff("iteration.cost.tiff", width=12, height =6, res =300, units = "in") print(p2) dev.off()
res$u
and res$r
denote the coordinates of two modalites in co-embeddings.library(umap) dim(res$u) dim(res$r) umap_plt <- umap(rbind(res$u, res$r)) umap_plt <- data.frame("UMAP1"=umap_plt$layout[,1], "UMAP2"=umap_plt$layout[,2], "celltype" = treatmentTime, "data" = c(rep("scRNA-seq",ncol(X)), rep("scATAC-seq",ncol(Y)))) xlim <- c(min(umap_plt$UMAP1), max(umap_plt$UMAP1)) ylim <- c(min(umap_plt$UMAP2), max(umap_plt$UMAP2)) p11 <- UMAP_plot(meta = umap_plt[umap_plt$data=="scRNA-seq",], alpha=1, color = "celltype", xlim = xlim, ylim = ylim, mylabel = paletteDiscrete(umap_plt$celltype) ) + ggtitle("scRNA-seq") p12 <- UMAP_plot(meta = umap_plt[umap_plt$data=="scATAC-seq",], alpha=1, color = "celltype", xlim = xlim, ylim = ylim,mylabel = paletteDiscrete(umap_plt$celltype) ) + ggtitle("scATAC-seq") p3 <- ggarrange(p11,p12, ncol=2) print(p3)
We can also show how bindSC improves the gene score matrix after iteration process using co-assayed cells as the validation. Given most of genes are not associated with DEX-treatment, the overall correlation is stilllower than 0.1.
Z_impu <- impuZ(X=X, bicca = res) colnames(Z_impu) <- colnames(Z0) coassay_cell <- intersect(colnames(X), colnames(Y)) p5 <- plot_geneScoreChange(X=X[,coassay_cell], Z0 = Z0[,coassay_cell], Z_impu = Z_impu[,coassay_cell]) tiff("matZ.improve.tiff", width=9, height =4, res =300, units = "in", compression = "lzw") print(p5) dev.off() p5
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.