In cuhklinlab/RA3: The R-based implementation of "A reference-guided approach for epigenetic characterization of single cells".

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(RA3)

Installation

You can install the released version of package Ra3 from Github:

devtools::install_github("cuhklinlab/RA3")

What is RA3?

RA3 is a R/Bioconductor package for the integrative analysis of scATAC-seq data, which could be used to extract effective latent features of single cells for downstream analyses such as visualization, clustering, and trajectory inference. The name RA3 refers to reference-guided analysis of scATAC-seq data.

RA3 characterizes the high-dimensional sparse scATAC-seq data as three components, including shared biological variation in single-cell and reference data, unique biological variation in single cells that separates distinct or rare cell types from the other cells, and other variations such as technical variation. It could use reference built from bulk ATAC-seq data, bulk DNase-seq data, an accessibility annotation tool (Chen et al., 2019), aggregated scATAC-seq data, etc.

Borrowing the general framework of probabilistic PCA, RA3 models the observed $p$ features/regions for the $j-$th cell $\mathbf{y}j \in \mathbb{R}^{p \times 1}$ as folows: $$ \begin{aligned} \mathbf{y}{j} | \boldsymbol{\lambda}{j} & \sim \mathcal{N}{p}\left(\boldsymbol{\lambda}{j}, \sigma^{2} \mathbf{I}{p}\right) ,\ \boldsymbol{\lambda}{j}=& \boldsymbol{\beta} \mathbf{x}{j}+\mathbf{W h}{j}, \boldsymbol{\lambda}{j} \in \mathbb{R}^{p \times 1}. \end{aligned} $$ The columns in $\mathbf{W}$ have similar interpretatiin as the projection vectors in PCA, and $\mathbf{h}j$ can be interpreted as the low-dimensional representation of $\mathbf{y}_j$. We further decommposes the term $\mathbf{W}\mathbf{h}_j$ into three parts: $$ \mathbf{W h}{j}=\mathbf{W}{1} \mathbf{h}{j_{1}}+\mathbf{W}{2} \mathbf{h}{j_{2}}+\mathbf{W}{3} \mathbf{h}{j_{3}}, $$ where the first part $\mathbf{W}1\mathbf{h}{j1} \in \mathbb{R}^{p\times K_1}$ utilizes prior information from the reference data and capture the shared biological variation in scATAC-sec data and reference data; the second part $\mathbf{W}2\mathbf{h}{j2} \in \mathbb{R}^{p \times K_2}$ captures the variation unique in scATAC-sec that seperates distinct and rare cell types from the other cells; the third part $\mathbf{W}3\mathbf{h}{j3} \in \mathbb{R}^{p \times K_3}$ models other variation such as technical variation.

Running RA3

In this vignette, we will show two example using RA3 for integrative analysis. Running RA3 uses main function runRA3, which consists of 2 main steps:

load data and calculate initializatioin for the model building function RA3_EM,
run model estimation function RA3_EM.

A. RA3 on scATAC data guided by abundant bulk data

In this example we want to show RA3 could take advantages of abundant bulk data. Here we use scATAC-seq count matrix donorBM0828_sc_mat and reference data donorBM0828_bulk_mat included by package RA3. The scATAC-seq data are collected human hematopoietic cells with donor code “BM0828” from the Bone marrow dataset presented in Buenrostro et al., 2018, and bulk data refers to the bulk hematopoietic ATAC-seq samples of four parent types in differentiation tree, including HSC, MPP, LMPP and CMP.

Input data

First, laod package RA3:

library(RA3)

We would use main function runRA3() to estimate a RA3 model, the input data consists of two parts: sc_data refering to single cell ATAC-seq data and ref_data as reference data. The input data should be count matrix. Load data:

sc_data <- donorBM0828_sc_mat
ref_data <- donorBM0828_bulk_mat

Building the models

The main parameters for running function runRA3:

K2, K3 are the number of compomnents contained in the second part $\mathbf{W}_2\mathbf{H}_2$ and third part $\mathbf{W}_3\mathbf{H}_3$ of RA3 model, the default value is set as K2 = 5, K3 = 5.
normalize is a logical indicator for whether the output estimated $\mathbf{W}$ should be normalized through each components, corresponding scale would be mupltiplied to $\mathbf{H}$.
ttest is a logcial indicator deciding whether a one sample t-test should be done for the output estimated $\mathbf{H}_2$, which would select most significantly informative components contained in $\mathbf{H}$.

Interpreting the models

To run RA3 model, you only need to run the function runRA3:

result <- runRA3(sc_data, ref_data)

runRA3 returns a list with estimated parameters and latent vatiables contained in RA3 model:

H: the extracted latent features of single cells for downstream analyses such as visualization, clustering, and trajectory inference.
Other Parameters: W, Beta, Gamma, A, sigma_s are estimated essential parameters in RA3 model, lgp refers to the largest log posterior value when EM algorithm converges.

Downstream analysis: Visualization

Visualization of latent features obtained by PCA and Bulk-projected scores

First we normalize data with TF-IDF by the data-preprocess function Dataprep of RA3:

data_nml <- Dataprep(sc_data, ref_data)
sc_data_nml <- data_nml$sc_data
ref_data_nml <- data_nml$ref_data

To illustrate RA3's effectiveness, we first use PCA to obtain the latent features on the normalized single-cell data, which shows unsatisfactory performance that only slightly separates cells of CLP and MEP from the other cells.

library(ggplot2)
res_pca <- irlba::irlba(t(sc_data_nml), nv=10)
score_pca <- res_pca$u %*% diag(res_pca$d)
pca_tsne <- Rtsne::Rtsne(score_pca)$Y
pca_tsne <- as.data.frame(pca_tsne)
colnames(pca_tsne) <- c('tsne1','tsne2')
cell_label <- donorBM0828_label_mat
ggplot(pca_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 10, position = ggplot2::position_jitter()) + labs(title = "pca tsne")

Then we scores each cell by the identified PCs of variation in bulk ATAC-seq samples, and deploy t-SNE directly onto the obtained scores. However, this reference guided approach still can not even distinguish CLP from the other cells while simple PCA can.

res_bulk <- prcomp(ref_data_nml, center = T, retx = T)
coeff_bulk <- res_bulk$rotation
bulk_tsne <- Rtsne::Rtsne(t(sc_data_nml) %*% coeff_bulk)$Y
bulk_tsne <- as.data.frame(bulk_tsne)
colnames(bulk_tsne) <- c('tsne1','tsne2')
ggplot(bulk_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 10, position = ggplot2::position_jitter()) + labs(title="bulk projection tsne")

Visualization of RA3

We use a t-SNE for visualization of the RA3-obtained latent features H.

H_tsne <- Rtsne::Rtsne(t(result$H[1: (nrow(result$H)-5), ]))$Y
H_tsne <- as.data.frame(H_tsne)
colnames(H_tsne) <- c('tsne1', 'tsne2')
ggplot(H_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 10, position = ggplot2::position_jitter()) + labs(title = "RA3 tsne")

RA3 introduces a spike-and-slab setting which detects directions that lead to good separation of the cells but not the direction with large variation considering the technical variation is strong. The first sparse component in spike-and-slab setting successfully detects cells of CLP, and RA3 thus achieves superior performance than other reference-guided approaches. An intuitive understanding is that the direction which separates a small number of cells from the rest more likely captures biological variation, given that the rare cell types in single-cell data are likely missing in bulk data.

B. RA3 on scATAC data guided by aggregating single-cell data

Besides using abundant bulk data as reference, we also display an example which established a pseudo-bulk reference data by averaging single cells of the same ground-truth biological cell type.

Input the data

Here we load the already established pseudo-bulkk reference data forebrain_bulk_mat as reference data for RA3, and forebrain_sc_mat as scATAC-seq count matrix.

sc_data2 <- forebrain_sc_mat
ref_data2 <- forebrain_bulk_mat

Building and interpreting the models

We use runRA3() to establish the model with input data: forebrain_sc_mat, forebrain_bulk_mat and defaut parameters: K2 = 5, K3 = 5, normalize = TRUE, ttest = TRUE.

result2 <- runRA3(sc_data2, ref_data2)

runRA3 returns a list with estimated parameters and latent vatiables contained in RA3 model:

H: the extracted latent features of single cells for downstream analyses such as visualization, clustering, and trajectory inference.
Other Parameters: W, Beta, Gamma, A, sigma_s are estimated essential parameters in RA3 model, lgp refers to the largest log posterior value when EM algorithm converges.

Downstream analysis: Visualization

Visualization of latent features obtained by PCA and Bulk-projected scores

First we normalize data with TF-IDF by the data-preprocess function Dataprep of RA3:

data2 <- Dataprep(sc_data2, ref_data2)
sc_data2_nml <- data2$sc_data
ref_data2_nml <- data2$ref_data

We use PCA to obtain the latent features on the normalized single-cell data, the t-SNE visualization is displayed by following:

library(ggplot2)
res2_pca <- irlba::irlba(t(sc_data2_nml), nv=10)
score2_pca <- res2_pca$u %*% diag(res2_pca$d)
pca2_tsne <- Rtsne::Rtsne(score2_pca)$Y
pca2_tsne <- as.data.frame(pca2_tsne)
colnames(pca2_tsne) <- c('tsne1','tsne2')
cell_label2 <- forebrain_label_mat
ggplot(pca2_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label2), shape = 10, position = ggplot2::position_jitter()) + labs(title = "pca tsne")

Then we scores each cell by the identified PCs of variation in pseudo-bulk ATAC-seq samples, and deploy t-SNE directly onto the obtaned scores.

res2_bulk <- prcomp(ref_data2_nml, center = T, retx = T)
coeff2_bulk <- res2_bulk$rotation
bulk2_tsne <- Rtsne::Rtsne(t(sc_data2_nml) %*% coeff2_bulk)$Y
bulk2_tsne <- as.data.frame(bulk2_tsne)
colnames(bulk2_tsne) <- c('tsne1','tsne2')
ggplot(bulk2_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label2), shape = 10, position = ggplot2::position_jitter()) + labs(title="bulk projection tsne")

Visualization of RA3

Finally, we present the t-SNE visualization of latent features extracted by RA3, guided by the aggregatting single cell data.

H2_tsne <- Rtsne::Rtsne(t(result2$H[1: (nrow(result2$H)-5), ]))$Y
H2_tsne <- as.data.frame(H2_tsne)
colnames(H2_tsne) <- c('tsne1', 'tsne2')
ggplot(H2_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label2), shape = 10, position = ggplot2::position_jitter()) + labs(title = "RA3 tsne")

References

Chen, S., Wang, Y. & Jiang, R. Openanno: annotating genomic regions with chromatin accessibility. bioRxiv 596627 (2019).
Buenrostro, J. D. et al. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell 173, 1535–1548 e16 (2018).

SessionInfo

cuhklinlab/RA3 documentation built on March 18, 2021, 4:38 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

cuhklinlab/RA3
The R-based implementation of "A reference-guided approach for epigenetic characterization of single cells".

In cuhklinlab/RA3: The R-based implementation of "A reference-guided approach for epigenetic characterization of single cells".

Installation

What is RA3?

Running RA3

A. RA3 on scATAC data guided by abundant bulk data

Input data

Building the models

Interpreting the models

Downstream analysis: Visualization

Visualization of latent features obtained by PCA and Bulk-projected scores

Visualization of RA3

B. RA3 on scATAC data guided by aggregating single-cell data

Input the data

Building and interpreting the models

Downstream analysis: Visualization

Visualization of latent features obtained by PCA and Bulk-projected scores

Visualization of RA3

References

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

cuhklinlab/RA3 The R-based implementation of "A reference-guided approach for epigenetic characterization of single cells".

In cuhklinlab/RA3: The R-based implementation of "A reference-guided approach for epigenetic characterization of single cells".

Installation

What is RA3?

Running RA3

A. RA3 on scATAC data guided by abundant bulk data

Input data

Building the models

Interpreting the models

Downstream analysis: Visualization

Visualization of latent features obtained by PCA and Bulk-projected scores

Visualization of RA3

B. RA3 on scATAC data guided by aggregating single-cell data

Input the data

Building and interpreting the models

Downstream analysis: Visualization

Visualization of latent features obtained by PCA and Bulk-projected scores

Visualization of RA3

References

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

cuhklinlab/RA3
The R-based implementation of "A reference-guided approach for epigenetic characterization of single cells".