knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

```r

library(RA3)

```

Installation

You can install the released version of package Ra3 from Github:

devtools::install_github("cuhklinlab/RA3")

What is RA3?

RA3 is a R/Bioconductor package for the integrative analysis of scATAC-seq data, which could be used to extract effective latent features of single cells for downstream analyses such as visualization, clustering, and trajectory inference. The name RA3 refers to reference-guided analysis of scATAC-seq data.

RA3 characterizes the high-dimensional sparse scATAC-seq data as three components, including shared biological variation in single-cell and reference data, unique biological variation in single cells that separates distinct or rare cell types from the other cells, and other variations such as technical variation. It could use reference built from bulk ATAC-seq data, bulk DNase-seq data, an accessibility annotation tool (Chen et al., 2019), aggregated scATAC-seq data, etc.

Borrowing the general framework of probabilistic PCA, RA3 models the observed $p$ features/regions for the $j-$th cell $\mathbf{y}j \in \mathbb{R}^{p \times 1}$ as folows: $$ \begin{aligned} \mathbf{y}{j} | \boldsymbol{\lambda}{j} & \sim \mathcal{N}{p}\left(\boldsymbol{\lambda}{j}, \sigma^{2} \mathbf{I}{p}\right) ,\ \boldsymbol{\lambda}{j}=& \boldsymbol{\beta} \mathbf{x}{j}+\mathbf{W h}{j}, \boldsymbol{\lambda}{j} \in \mathbb{R}^{p \times 1}. \end{aligned} $$ The columns in $\mathbf{W}$ have similar interpretatiin as the projection vectors in PCA, and $\mathbf{h}j$ can be interpreted as the low-dimensional representation of $\mathbf{y}_j$. We further decommposes the term $\mathbf{W}\mathbf{h}_j$ into three parts: $$ \mathbf{W h}{j}=\mathbf{W}{1} \mathbf{h}{j_{1}}+\mathbf{W}{2} \mathbf{h}{j_{2}}+\mathbf{W}{3} \mathbf{h}{j_{3}}, $$ where the first part $\mathbf{W}1\mathbf{h}{j1} \in \mathbb{R}^{p\times K_1}$ utilizes prior information from the reference data and capture the shared biological variation in scATAC-sec data and reference data; the second part $\mathbf{W}2\mathbf{h}{j2} \in \mathbb{R}^{p \times K_2}$ captures the variation unique in scATAC-sec that seperates distinct and rare cell types from the other cells; the third part $\mathbf{W}3\mathbf{h}{j3} \in \mathbb{R}^{p \times K_3}$ models other variation such as technical variation.

Running RA3

In this vignette, we will show two example using RA3 for integrative analysis. Running RA3 uses main function runRA3, which consists of 2 main steps:

  1. load data and calculate initializatioin for the model building function RA3_EM,

  2. run model estimation function RA3_EM.

A. RA3 on scATAC data guided by aggregating single-cell data

First, load package RA3:

library(RA3)

This example established a pseudo-bulk reference data by averaging single cells of the same ground-truth biological cell type.

Input the data

Here we load the already established pseudo-bulk reference data forebrain_bulk_mat as reference data for RA3, and forebrain_sc_mat as scATAC-seq count matrix.

sc_data <- forebrain_sc_mat
ref_data <- forebrain_bulk_mat

Building and interpreting the models

We use runRA3() to establish the model with input data: forebrain_sc_mat, forebrain_bulk_mat and defaut parameters: K2 = 5, K3 = 5, normalize = TRUE, ttest = TRUE.

set.seed(2020)
result <- runRA3(sc_data, ref_data)

runRA3 returns a list with estimated parameters and latent vatiables contained in RA3 model:

Visualization

Visualization of latent features obtained by PCA and Bulk-projected scores

First we normalize data with TF-IDF by the data-preprocess function Dataprep of RA3:

data_nml <- Dataprep(sc_data, ref_data)
sc_data_nml <- data_nml$sc_data
ref_data_nml <- data_nml$ref_data
cell_label <- forebrain_label_mat

We use PCA to obtain the latent features on the normalized single-cell data, the t-SNE visualization is displayed by following:

library(ggplot2)
theme_set(theme_gray() +
              theme(
                axis.line = element_line(size=0.5),
                panel.background = element_rect(fill=NA,size=rel(20)),
                panel.grid.minor = element_line(colour = NA),
                axis.text = element_text(size=16),
                axis.title = element_text(size=18)
              )
  )
pca_tsne <- RA3::RA3_pcatsne(sc_data_nml, rand_seed = 2020)
ggplot(pca_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 20, position = ggplot2::position_jitter()) + labs(title = "pca tsne")

Then we scores each cell by the identified PCs of variation in pseudo-bulk ATAC-seq samples, and deploy t-SNE directly onto the obtaned scores.

bulk_tsne <- RA3_RefProj(ref_data_nml,sc_data_nml,rand_seed = 2020)

ggplot(bulk_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 20, position = ggplot2::position_jitter()) + labs(title="bulk projection tsne")

Visualization of RA3

Finally, we present the t-SNE visualization of latent features extracted by RA3, guided by the aggregatting single cell data.

set.seed(2020)
H2_tsne <- Rtsne::Rtsne(t(result$H))$Y
H2_tsne <- as.data.frame(H2_tsne)
colnames(H2_tsne) <- c('tsne1', 'tsne2')
ggplot(H2_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 20, position = ggplot2::position_jitter()) + labs(title = "RA3 tsne")

B. RA3 on scATAC data guided by abundant bulk data

In this example we want to show RA3 could take advantages of abundant bulk data. Here we use scATAC-seq count matrix donorBM0828_sc_mat and reference data donorBM0828_bulk_mat included by package RA3. The scATAC-seq data are collected human hematopoietic cells with donor code “BM0828” from the Bone marrow dataset presented in Buenrostro et al., 2018, and bulk data refers to the bulk hematopoietic ATAC-seq samples of four parent types in differentiation tree, including HSC, MPP, LMPP and CMP.

Input data

First, load package RA3:

library(RA3)

We would use main function runRA3() to estimate a RA3 model, the input data consists of two parts: sc_data refering to single cell ATAC-seq data and ref_data as reference data. The input data should be count matrix. Load data:

sc_data <- donorBM0828_sc_mat
ref_data <- donorBM0828_bulk_mat
cell_label <- donorBM0828_label_mat

Building the models

The main parameters for running function runRA3:

Interpreting the models

To run RA3 model, you only need to run the function runRA3:

set.seed(2020)
result <- runRA3(sc_data, ref_data)

runRA3 returns a list with estimated parameters and latent vatiables contained in RA3 model:

Visualization

Visualization of latent features obtained by PCA and Bulk-projected scores

First we normalize data with TF-IDF by the data-preprocess function Dataprep of RA3:

data_nml <- Dataprep(sc_data, ref_data)
sc_data_nml <- data_nml$sc_data
ref_data_nml <- data_nml$ref_data

To illustrate RA3's effectiveness, we first use PCA to obtain the latent features on the normalized single-cell data, which shows unsatisfactory performance that only slightly separates cells of CLP and MEP from the other cells.

library(ggplot2)

# Set color pallete same as the original study of human hematopoietic cells
cellTypeAll_U <- c('CLP','CMP','GMP','HSC','LMPP','MEP','MPP')
legend_cellType_color_rgb <- c("#98D9E9", "#FFC179", "#FFA300", "#00441B", "#00AF99", "#F6313E", "#46A040")
color_map <- hash::hash(cellTypeAll_U, legend_cellType_color_rgb)
cells_colors_rgb <- hash::values(color_map, keys=cell_label )

# Set color palette, using default palletes like 'Dark2' from RColorBrewer. 
# cellTypeAll_U <- c('CLP','CMP','GMP','HSC','LMPP','MEP','MPP')
# legend_cellType_color_rgb <- brewer.pal(length(unique(cell_label)), name="Dark2")
# color_map <- hash(cellTypeAll_U, legend_cellType_color_rgb)
# cells_colors_rgb <- hash::values(color_map, keys=cell_label)

# Configuration for the plot
theme_set(theme_gray() +
            theme(
              axis.line = element_line(size=0.5),
              panel.background = element_rect(fill=NA,size=rel(20)),
              panel.grid.minor = element_line(colour = NA),
              axis.text = element_text(size=16),
              axis.title = element_text(size=18)
            )
)
pca_tsne <- RA3_pcatsne(sc_data_nml, rand_seed=2020)

ggplot(pca_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 20, position = ggplot2::position_jitter()) + labs(title = "pca tsne") + scale_color_manual(values = legend_cellType_color_rgb)

Then we scores each cell by the identified PCs of variation in bulk ATAC-seq samples, and deploy t-SNE directly onto the obtained scores. However, this reference guided approach still can not even distinguish CLP from the other cells while simple PCA can.

bulk_tsne <- RA3_RefProj(ref_data_nml,sc_data_nml)

ggplot(bulk_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 20, position = ggplot2::position_jitter()) + labs(title="bulk projection tsne") + scale_color_manual(values = legend_cellType_color_rgb)

Visualization of RA3

We use a t-SNE for visualization of the RA3-obtained latent features H.

set.seed(2020)
H_tsne <- Rtsne::Rtsne(t(result$H))$Y
H_tsne <- as.data.frame(H_tsne)
colnames(H_tsne) <- c('tsne1', 'tsne2')

ggplot(H_tsne) + geom_point(aes(tsne1, tsne2, colour = cell_label), shape = 20, position = ggplot2::position_jitter()) + labs(title = "RA3 tsne") + scale_color_manual(values = legend_cellType_color_rgb)

RA3 introduces a spike-and-slab setting which detects directions that lead to good separation of the cells but not the direction with large variation considering the technical variation is strong. The first sparse component in spike-and-slab setting successfully detects cells of CLP, and RA3 thus achieves superior performance than other reference-guided approaches. An intuitive understanding is that the direction which separates a small number of cells from the rest more likely captures biological variation, given that the rare cell types in single-cell data are likely missing in bulk data.

Cell clustering

Cell clustering could be deployed on the low dimensional representation output by RA3. Function RA3_clustering uses package Seurat's louvain clustering method.

Seurat_louvain <- RA3_clustering(result$H, length(unique(cell_label)))

Downstream analysis

Other than data visualization and cell clustering, the output of RA3 can also be implemented in downstream analyses including trajectory inference and motif enrichment analysis.

Trajectory inference

We implemented Slingshot for the trajectory inference. The inputs for Slingshot include the low-dimensional representation provided by RA3 and the cluster labels obtained from RA3 + Louvain clustering, and the output for Slingshot are the smooth curves representing the estimated cell lineages.

To do trajectory inference, one can simpily use function RA3_TrajInf. For documentation of this function, try ? RA3_TrajInf.

trajinf <- RA3_TrajInf(result$H, Seurat_louvain, rand_seed = 2020)

Based on the estimated curve, one can visualize the estimated cell lineages using following chunk.

# Load necessary packages
suppressPackageStartupMessages({
  library(RColorBrewer)
  library(hash)
  library(ggplot2)
  })



# Plot cell lineage
slingshot::plot(trajinf$tsne,pch=20,col=cells_colors_rgb ,xlab="tsne1",ylab="tsne2")
slingshot::lines(trajinf$sds.new, lwd = 3)
par(xpd=TRUE)
legend( "bottomleft",legend=cellTypeAll_U , pch=20,
       col=legend_cellType_color_rgb, box.lty=0)

Motif enrichment analysis

For the motif enrichment analysis, we first obtained the cluster labels by RA3 + Louvain clustering, and then identified the cluster-specific peaks by the hypothesis testing procedure in scABC. We selected the top 1000 peaks with the smallest p-values for each cluster, and then applied chromVAR to infer the enriched transcription factor (TF) binding motifs within these peaks. Visualization of the top 50 most variable TF binding motifs is shown.

To do motif analysis, use function RA3_motif. This function needs peak information for scCAS data. We have donorBM0828_peak_vec in RA3 package as a demo, which contains peak information for scATAC-seq data donorBM0828.

peaks <- donorBM0828_peak_vec
louvain <- as.numeric(Seurat_louvain) 

result_motif <- RA3_motif(peaks, cell_label, louvain, sc_data, cluster_peak_num=1000, motif_num=50)

Based on the chromVAR deviations, one can make the heatmap using following chunks.

# Set color for the heatmap
suppressPackageStartupMessages({
  library(gplots) 
  library(RColorBrewer)
  library(hash)
  library(vegan)

})
scalebluered <- colorRampPalette(c("blue", "white", "red"), space = "rgb")(256)
cols = brewer.pal(7, "Dark2")[1:7]
rowcols_louvain = cols[as.numeric(Seurat_louvain) ]  
rowcols_louvain = rowcols_louvain[result_motif$metaDataInd]


cellTypeAll_U <- c('CLP','CMP','GMP','HSC','LMPP','MEP','MPP')
legend_cellType_color_rgb <- c("#98D9E9", "#FFC179", "#FFA300", "#00441B", "#00AF99", "#F6313E", "#46A040")
color_map <- hash(cellTypeAll_U, legend_cellType_color_rgb)
cells_colors_rgb <- hash::values(color_map, keys=cell_label)
cells_colors_rgb = cells_colors_rgb[result_motif$metaDataInd]
label = cellTypeAll_U

d = vegdist(result_motif$top_devs, method = "euclidean", na.rm = TRUE)
col.clus = hclust(d, "centroid")

Plot the heatmap using true labels.

heatmap.2(t(result_motif$top_devs), dendrogram='column', Rowv=FALSE, Colv=as.dendrogram(col.clus), trace='none', col = scalebluered, density.info = "none", RowSideColors = cells_colors_rgb, margin = c(10, 1))
legend("bottomleft", legend = label, 
       col = legend_cellType_color_rgb, border=FALSE, bty="n", y.intersp = 0.7, cex=0.7, pch = 15)

Plot the heatmap using estimated labels.

# Plot using true labels
heatmap.2(t(result_motif$top_devs), dendrogram='column', Rowv=FALSE, Colv=as.dendrogram(col.clus), trace='none', col = scalebluered, density.info = "none", RowSideColors = rowcols_louvain, margin = c(10, 1))
legend("bottomleft", legend = c(paste0("cluster ", 1:7)), 
       col = cols, border=FALSE, bty="n", y.intersp = 0.7, cex=0.7, pch = 15)

References

  1. Chen, S., Wang, Y. & Jiang, R. Openanno: annotating genomic regions with chromatin accessibility. bioRxiv 596627 (2019).

  2. Buenrostro, J. D. et al. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell 173, 1535–1548 e16 (2018).

SessionInfo

sessionInfo()
# R version 4.0.2 (2020-06-22)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS Catalina 10.15.4
# 
# Matrix products: default
# BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] vegan_2.5-6        lattice_0.20-41    permute_0.9-5      gplots_3.0.4       RA3_0.1.0          hash_2.2.6.1      
# [7] RColorBrewer_1.1-2 ggplot2_3.3.2     
# 
# loaded via a namespace (and not attached):
#   [1] backports_1.1.8                   chromVAR_1.10.0                   plyr_1.8.6                       
#   [4] igraph_1.2.5                      lazyeval_0.2.2                    splines_4.0.2                    
#   [7] BiocParallel_1.22.0               listenv_0.8.0                     usethis_1.6.1                    
#  [10] GenomeInfoDb_1.24.2               TFBSTools_1.26.0                  digest_0.6.25                    
#  [13] htmltools_0.5.0                   GO.db_3.11.4                      gdata_2.18.0                     
#  [16] fansi_0.4.1                       JASPAR2016_1.16.0                 magrittr_1.5                     
#  [19] memoise_1.1.0                     BSgenome_1.56.0                   cluster_2.1.0                    
#  [22] ROCR_1.0-11                       remotes_2.1.1                     globals_0.12.5                   
#  [25] Biostrings_2.56.0                 readr_1.3.1                       annotate_1.66.0                  
#  [28] matrixStats_0.56.0                R.utils_2.9.2                     prettyunits_1.1.1                
#  [31] princurve_2.1.4                   colorspace_1.4-1                  blob_1.2.1                       
#  [34] ggrepel_0.8.2                     xfun_0.15                         dplyr_1.0.0                      
#  [37] callr_3.4.3                       crayon_1.3.4                      RCurl_1.98-1.2                   
#  [40] jsonlite_1.7.0                    TFMPvalue_0.0.8                   survival_3.2-3                   
#  [43] zoo_1.8-8                         ape_5.4                           glue_1.4.1                       
#  [46] gtable_0.3.0                      zlibbioc_1.34.0                   XVector_0.28.0                   
#  [49] leiden_0.3.3                      DelayedArray_0.14.0               pkgbuild_1.0.8                   
#  [52] future.apply_1.6.0                SingleCellExperiment_1.10.1       BiocGenerics_0.34.0              
#  [55] scales_1.1.1                      DBI_1.1.0                         miniUI_0.1.1.1                   
#  [58] Rcpp_1.0.5                        viridisLite_0.3.0                 xtable_1.8-4                     
#  [61] reticulate_1.16                   rsvd_1.0.3                        bit_1.1-15.2                     
#  [64] tsne_0.1-3                        stats4_4.0.2                      DT_0.14                          
#  [67] htmlwidgets_1.5.1                 httr_1.4.1                        nabor_0.5.0                      
#  [70] ellipsis_0.3.1                    Seurat_3.1.5                      ica_1.0-2                        
#  [73] pkgconfig_2.0.3                   XML_3.99-0.4                      R.methodsS3_1.8.0                
#  [76] farver_2.0.3                      uwot_0.1.8                        tidyselect_1.1.0                 
#  [79] labeling_0.3                      rlang_0.4.7                       reshape2_1.4.4                   
#  [82] later_1.1.0.1                     AnnotationDbi_1.50.1              munsell_0.5.0                    
#  [85] tools_4.0.2                       cli_2.0.2                         DirichletMultinomial_1.30.0      
#  [88] generics_0.0.2                    RSQLite_2.2.0                     devtools_2.3.0                   
#  [91] ggridges_0.5.2                    stringr_1.4.0                     fastmap_1.0.1                    
#  [94] fs_1.4.2                          processx_3.4.3                    knitr_1.29                       
#  [97] bit64_0.9-7                       fitdistrplus_1.1-1                caTools_1.18.0                   
# [100] purrr_0.3.4                       RANN_2.6.1                        KEGGREST_1.28.0                  
# [103] pbapply_1.4-2                     future_1.18.0                     nlme_3.1-148                     
# [106] mime_0.9                          R.oo_1.23.0                       poweRlaw_0.70.6                  
# [109] pracma_2.2.9                      compiler_4.0.2                    rstudioapi_0.11                  
# [112] plotly_4.9.2.1                    png_0.1-7                         testthat_2.3.2                   
# [115] tibble_3.0.2                      stringi_1.4.6                     ps_1.3.3                         
# [118] desc_1.2.0                        CNEr_1.24.0                       Matrix_1.2-18                    
# [121] vctrs_0.3.1                       pillar_1.4.6                      lifecycle_0.2.0                  
# [124] lmtest_0.9-37                     RcppAnnoy_0.0.16                  data.table_1.12.8                
# [127] cowplot_1.0.0                     bitops_1.0-6                      irlba_2.3.3                      
# [130] httpuv_1.5.4                      patchwork_1.0.1                   rtracklayer_1.48.0               
# [133] GenomicRanges_1.40.0              R6_2.4.1                          promises_1.1.1                   
# [136] gridExtra_2.3                     KernSmooth_2.23-17                IRanges_2.22.2                   
# [139] sessioninfo_1.1.1                 codetools_0.2-16                  pkgload_1.1.0                    
# [142] assertthat_0.2.1                  MASS_7.3-51.6                     gtools_3.8.2                     
# [145] seqLogo_1.54.3                    SummarizedExperiment_1.18.1       rprojroot_1.3-2                  
# [148] withr_2.2.0                       sctransform_0.2.1                 GenomicAlignments_1.24.0         
# [151] Rsamtools_2.4.0                   S4Vectors_0.26.1                  GenomeInfoDbData_1.2.3           
# [154] mgcv_1.8-31                       parallel_4.0.2                    hms_0.5.3                        
# [157] motifmatchr_1.10.0                grid_4.0.2                        tidyr_1.1.0                      
# [160] BSgenome.Hsapiens.UCSC.hg19_1.4.3 slingshot_1.6.1                   Rtsne_0.15                       
# [163] Biobase_2.48.0                    shiny_1.5.0            


cuhklinlab/RA3 documentation built on March 18, 2021, 4:38 p.m.