knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  comment = "#>"
)

This guide will demonstrate the usage of DCCA to jointly define cell types by leveraging mulitple single-cell modalities. Here we integrated scRNA-seq and scATAC-seq to define cell types that incorporate both gene expression and chromatin accessibility data.

Stage 1: Load the scRNA-seq and scATAC-seq data

For convenience, we have prepared the pre-processed data which are ready to use. User can refer to A549_preprocess.ATAC.html and A549_preprocess.RNA.html for the details of running the pre-processing workflow (It will take 10 ~ 20 mins).

library("DCCA")
library("bigstatsr")
library("progress")
library("irlba")
library("ggpubr")
library("ggplot2")
library("umap")

A549_RNA <- readRDS("./A549_rna.rds")
A549_ATAC <- readRDS("./A549_atac.rds")

summary(A549_RNA)
summary(A549_ATAC)

We then visualize each cell, colored by cell type, from two technologies. Left is from scRNA-seq and right is from scATAC-seq. For both technologies, cells from 0 h and 1/3 h can be well separated in 2d-UMAP.

p1<-ggscatter(A549_RNA$RNA_meta, x = "UMAP_1", y = "UMAP_2",
   color = "cell_type", palette = c("darkseagreen4","lightpink","darkorchid1"), 
   repel = FALSE,size=0.5, alpha=0.5,legend.title = "", title="scRNA-seq", font.title=16) 

p2<-ggscatter(A549_ATAC$ATAC_meta, x = "UMAP_1", y = "UMAP_2",
   color = "group", palette = c("darkseagreen4","lightpink","darkorchid1"), 
   repel = FALSE,size=0.5, alpha=0.5,legend.title = "", title="scATAC-seq", font.title=16) 

p1<-p1+rremove("axis") + rremove("ticks") + rremove("xylab")+ rremove("axis.text") 
p2<-p2+rremove("axis") + rremove("ticks") + rremove("xylab")+ rremove("axis.text")

p<-ggarrange(p1, p2, nrow = 1, common.legend = TRUE, legend = "right")
p

Stage 2: Run DCCA

We next perform DCCA on A549 data. DCCA requires three matrices as input: - X : gene expression matrix from scRNA-seq data - Y : peak matrix from scATAC-seq data - Z0 : gene activity matrix from scATAC-seq data The gene activity matrix Z0 can be estimated by counting the total number of ATAC-seq reads within the gene body/promoter region of each gene in each cell. More details can be seen in A549_preprocess.ATAC.html

X <- A549_RNA$X
Y <- A549_ATAC$Y
Z0 <- A549_ATAC$Z0
dim(X)
dim(Y)
dim(Z0)

Make sure X and Z0 have matched gene features, Y and Z0 have matched cell names.

gene.overlap <- intersect(rownames(X), rownames(Z0))
cell.overlap <- intersect(colnames(Y), colnames(Z0))

X <- as.matrix(X[gene.overlap,])
Z0 <- as.matrix(Z0[gene.overlap, cell.overlap])
Y <- as.matrix(Y[,cell.overlap])

Key parameters to run DCCA:

This process will take 30 mins with maximal memory usage being 15G.

source("/home/jdou1/project/integraModel/dev_5/src/DCCA/R/generics.R")
source("/home/jdou1/project/integraModel/dev_5/src/DCCA/R/preProcessing.R")
source("/home/jdou1/project/integraModel/dev_5/src/DCCA/R/DCCA.R")

out <- DCCA(X=X, Z0=Z0, Y=Y, 
       num.D = 5, num.E = 5, 
       num.iteration = 100, 
       temp.path= "./tp",
       tolerance = 0.05, 
       save = TRUE, 
       bigMemory = TRUE, block.size = 1000)

Stage 3: Visualization of cells from two data types in latent space

DCCA returns the list with five matrices:

DCCA also returns the relative change for Z, - delta: relative change ratio for Z during iteration

Cell coordinates for two datasets in latent space are often sufficient to define joint clusters that correpond across datasets. We used UMAP to visualize the cells in latent space colored by cell type (left figure) and by technology (right figure).

source("./eval_plot.r")
tp <- readRDS(file="./tp/iteration5.rds")
cell_type <- c(A549_RNA$RNA_meta$cell_type, A549_ATAC$ATAC_meta$group)
data_type <- c(rep("scRNA", dim(out$u)[1]), rep("scATAC",dim(out$r)[1]))
result <- umap_plot(tp, cell_type, data_type)
result$plt

Stage 4: Comparsion among DCCA, Seurat, and Liger methods

Three metrics are used to measure method performance since we have the cell correspondence as the gold standard. For convenience, we have prepared the pre-processed results for Seurat and Liger which are ready to use.

eval <-  method_compare(result$plt_dt)
eval$coembeding

# Gray curve in anchor accuracy metric denotes results from random guess. 
eval$eval


jinzhuangdou/SCCAT documentation built on July 8, 2020, 2:36 p.m.