knitr::opts_chunk$set(echo = TRUE, cache = TRUE)

This guide will demonstrate the usage of bindSC to jointly define cell types by leveraging mulitple single-cell modalities. Here we define cell types that incorporate both gene expression and chromatin accessibility data. The dataset is generated from the sci-CAR technology (Cao et al., 2018, science).

Stage 1: Load the sciRNA-seq and sciATAC-seq data

For convenience, we have prepared the pre-processed data which are ready to use. User can refer to A549_preprocess.ATAC.html and A549_preprocess.RNA.html for the details of running the pre-processing workflow (It will take 10 ~ 20 mins). For RNA data, there are 6,005 cells measured on 4,759 genes (A549_RNA$X). For ATAC data, there are 3,628 cells measured on 13,515 genes (gene activity estimated from peak; A549_ATAC$Z0) and 24,953 peaks (A549_ATAC$Y). User can download the preprocessed sci-RNA data from https://drive.google.com/file/d/1tjMA9yERXYBx-GKKrGORKF1-4Zs9xScM/view?usp=sharing

The preprocessed sci-ATAC data from https://drive.google.com/file/d/1REKXV-TTm1rkWEyX82Q1R6dOtE-dPsza/view?usp=sharing

library(bindSC)

A549_RNA <- readRDS("../../data/A549_rna.rds")
A549_ATAC <- readRDS("../../data/A549_atac.rds")

summary(A549_RNA)
summary(A549_ATAC)
dim(A549_RNA$X)
dim(A549_ATAC$Z0)
dim(A549_ATAC$Y)

We then visualize each cell, colored by cell type, from two technologies. Left is from scRNA-seq and right is from scATAC-seq. For both technologies, cells from 0 h and 1/3 h can be well separated in 2D-UMAP.

library(ggpubr)
p1<-ggscatter(A549_RNA$RNA_meta, x = "UMAP_1", y = "UMAP_2",
   color = "cell_type", palette = c("darkseagreen4","lightpink","darkorchid1"), 
   repel = FALSE,size=0.5, alpha=0.5,legend.title = "", title="scRNA-seq", font.title=16) 

p2<-ggscatter(A549_ATAC$ATAC_meta, x = "UMAP_1", y = "UMAP_2",
   color = "group", palette = c("darkseagreen4","lightpink","darkorchid1"), 
   repel = FALSE,size=0.5, alpha=0.5,legend.title = "", title="scATAC-seq", font.title=16) 

p1<-p1+rremove("axis") + rremove("ticks") + rremove("xylab")+ rremove("axis.text") 
p2<-p2+rremove("axis") + rremove("ticks") + rremove("xylab")+ rremove("axis.text")

p<-ggarrange(p1, p2, nrow = 1, common.legend = TRUE, legend = "right")
p

Stage 2: Preparation of bindSC input

We next perform bindSC alignment on A549 data. In generall, bindSC requires three matrices as input:

The gene activity matrix Z0 is estimated by counting the total number of ATAC-seq reads within the gene body/promoter region of each gene in each cell.

X <- A549_RNA$X
Y <- A549_ATAC$Y
Z0 <- A549_ATAC$Z0


treatmentTime <- c(A549_RNA$RNA_meta$treatment_time, A549_ATAC$ATAC_meta$group)
treatmentTime[treatmentTime=="A549_0h"] <- 0
treatmentTime[treatmentTime=="A549_1h"] <- 1
treatmentTime[treatmentTime=="A549_3h"] <- 3

type <- c(rep("scRNA-seq",nrow(A549_RNA$RNA_meta)), rep("scATAC-seq",nrow(A549_ATAC$ATAC_meta)))

Make sure X and Z0 have matched gene features, Y and Z0 have matched cell names.

gene.overlap <- intersect(rownames(X), rownames(Z0))
cell.overlap <- intersect(colnames(Y), colnames(Z0))

X <- as.matrix(X[gene.overlap,])
Z0 <- as.matrix(Z0[gene.overlap, cell.overlap])
Y <- as.matrix(Y[,cell.overlap])

Generally, you can run bindSC using above inputs. However, the feature dimension is high for single cell epigenetic profiles (for example, >100k peaks for scATAC-seq data). Here we use low-dimension representations rather than the orignal matrixs for bindSC alignment. You should perform dimension reductions (SVD) on X and Z0 jointly.

out <- dimReduce( dt1 =  X, dt2 = Z0,  K = 30)
x <- out$dt1
z0 <- out$dt2
y  <- dimReduce(dt1 = Y, K=30)

Stage 3: Parameter optimization [optional]

There are two key parameters that may influence the integraton results: 1) lambda modality weighting factor; 2) alpha weighting factor of initilized gene score matrix. The parameter optimization step will run bindS alignment with 0<lambda<1 and 0<alpha<1. This may take a long time if the sampe size is large.We found the defualt settings with lambda = 0.5 and alpha = 0.5 work well on most integration tasks. You can skip the parameter optimization step if you want to use the default settings.

To run the parameter optimization step, you need to prepare for the cluster annotaion information for each dataset.

X.clst <- treatmentTime[type=="scRNA-seq"]
Y.clst <- treatmentTime[type=="scATAC-seq"]

# Given cells from 1/3hs are not separatable in A549 dataset, we only consider pre/post treatments
x.clst <- X.clst 
y.clst <- Y.clst
x.clst[x.clst>0] <- 1
y.clst[y.clst>0] <- 1
run <- FALSE
if(run){
  paraSel <- BiCCA_para_opt( X = t(x) ,
               Y = t(y), 
               Z0 =t(z0), 
               X.clst = x.clst,
               Y.clst = y.clst,
               alpha.lst= seq(0,1,0.1), 
               K.lst = c(5),
               lambda.lst = seq(0,1,0.1),
               num.iteration = 50,
               tolerance = 0.01,
               save = TRUE,
               block.size = 1000
  )
}

Use paraSel_plot function to show three integration metrics change with alpha and lambda. Overall, the mixing alignment metric (<0.02 difference) is robust to alpha and lambda and silhouette score is very low when alpha=1 and lambda=1.

run <- FALSE
if(run){
  p1 <- paraSel_plot(paraSel)
  p1

  tiff("paraSel.tiff", width=12, height =3, res =300, units = "in")
  print(p1)
  dev.off()
}

Stage 4: Run bindSC

We use defualt settings with lambda = 0.5 and alpha = 0.1. The iteration process will be done in ~10s.

res <- BiCCA( X = t(x) ,
             Y = t(y), 
             Z0 =t(z0), 
             X.clst = x.clst,
             Y.clst = y.clst,
             alpha = 0.1, 
             lambda = 0.5,
             K = 5,
             temp.path  = "out",
             num.iteration = 50,
             tolerance = 0.01,
             save = TRUE,
             parameter.optimize = FALSE,
             block.size = 0)

Check the iteration process using plotIteration function (optional). The first row shows the objective function costs for 3 terms separately. The fourth figure shows the total objective function over iteration time. The fifth figure figure shows the relative change of gene score matrix Z over iteration time.

p2 <- plotIteration(res)
p2

tiff("iteration.cost.tiff", width=12, height =6, res =300, units = "in")
print(p2)
dev.off()

Stage 4: Project cells from two modalites in co-embedding spaces. res$u and res$r denote the coordinates of two modalites in co-embeddings.

library(umap)
dim(res$u)
dim(res$r)
umap_plt <- umap(rbind(res$u, res$r))
umap_plt  <- data.frame("UMAP1"=umap_plt$layout[,1],
                        "UMAP2"=umap_plt$layout[,2],
                        "celltype" = treatmentTime,
                        "data" = c(rep("scRNA-seq",ncol(X)),
                                   rep("scATAC-seq",ncol(Y))))

xlim <- c(min(umap_plt$UMAP1), max(umap_plt$UMAP1))
ylim <- c(min(umap_plt$UMAP2), max(umap_plt$UMAP2))

p11 <- UMAP_plot(meta = umap_plt[umap_plt$data=="scRNA-seq",],  alpha=1,
                color = "celltype", xlim = xlim, ylim = ylim, mylabel = paletteDiscrete(umap_plt$celltype) ) + ggtitle("scRNA-seq")
p12 <- UMAP_plot(meta = umap_plt[umap_plt$data=="scATAC-seq",],  alpha=1,
              color = "celltype", xlim = xlim, ylim = ylim,mylabel = paletteDiscrete(umap_plt$celltype) )  + ggtitle("scATAC-seq")

p3 <- ggarrange(p11,p12, ncol=2)
print(p3)

Stage 5: Impute gene expression profiles for cells from scATAC.

We can also show how bindSC improves the gene score matrix after iteration process using co-assayed cells as the validation. Given most of genes are not associated with DEX-treatment, the overall correlation is stilllower than 0.1.

Z_impu <- impuZ(X=X, bicca = res)
colnames(Z_impu) <- colnames(Z0)

coassay_cell <- intersect(colnames(X), colnames(Y))
p5 <- plot_geneScoreChange(X=X[,coassay_cell], Z0 = Z0[,coassay_cell], Z_impu = Z_impu[,coassay_cell])


tiff("matZ.improve.tiff", width=9, height =4, res =300, units = "in", compression = "lzw")
print(p5)
dev.off()
p5
sessionInfo()


KChen-lab/bindSC documentation built on Sept. 29, 2022, 4:24 a.m.