In plantformatics/Socrates: Analysis of scATAC-seq data

knitr::opts_chunk$set(echo = TRUE)

Pre0processing

Clean, normalize and cluster the Socrates object that is automatically loaded into the namespace with the package. In this example, we will use tfidf for a bit of a speed-up over other available options.

# load Socrates
library(Socrates)

# filter, normalize, and reduce dimensions
obj <- cleanData(obj, min.c=1000, min.t=0.001, max.t=0.005)
obj <- tfidf(obj)
obj <- reduceDims(obj, n.pcs=50, cor.max=0.7)
obj <- projectUMAP(obj)

# cluster
obj <- callClusters(obj, res=0.6)

Identify co-accessible ACRs across all cells

To find co-accessible ACRs, Socrates needs a path to the reference genome. A chromosome lengths file for hg19 is provided in Socrates.

# get genome annotation
genome <- system.file("extdata", "hg19.txt", package="Socrates")

Socrates relies on the R package cicero, developed by the Trapenll and Shendure labs to identify co-accessible ACRs. Please cite their work here if using Socrates to identify co-accessible regions. For more information and documentation for cicero, visit the software website site here.

In the example below, we will identify co-accessible ACRs using all cells by aggregating k=30 nearest neighbors from the UMAP embedding. Users can adjust various parameters that are provided by cicero, in addition to added functionalities explored later in this tutorial.

# find co-accessible ACRs in the entire data set (not subset by cluster)
obj <- coAccess(obj, genome=genome)

Identify co-accessible ACRs per cluster

In addition to standard co-accessible ACR calling across all cells, Socrates facilitate cluster/cell type-level resolution for co-accessible ACR identification. Essentially, Socrates splits cells into their cognate groups, and runs cicero on each of these groups, independtly. Running Socrates on each cluster/cell type is simple as specifying an additional parameter in the function coAccess. Co-accessible ACRs are identified for each group in the groupID column of the meta data table obj$Clusters slot. The default groupID is "LouvainClusters", which is the default column name for the function callClusters. See ?callClusters and ?coAccess for more details. An example command is illustrated below.

# find co-accessible ACRs by cluster
obj <- coAccess(obj, genome=genome, byGroup=T)

Empirical FDR to control false positives

Removing false positive links based solely on the degree of co-accessibility is a challenge because different clusters/cell types are afflicted by varying degrees of technical variation, such as read depth. A simple approach to estimate the false discovery rate is to shuffle the ACR x barcode matrix, keeping the number of accessible ACRs per cell and the number of accessible cells per ACR identical to the original matrix. Co-accessible ACR identification is repeated and heuristic thresholds removing 95% (or other FDRs of the users choice) of co-accessible ACRs from the shuffled matrix (for positive and negatively associated ACRs) are identified. These thresholds are then applied to the original, non-shuffled matrix, resembling FDR controlled co-accessible ACRs. Users can apply the eFDR method to co-accessible ACRs identification on a per cluster/cell type basis by specifying the byGroup and groupID arguments, as described above. An example of co-accessible ACR calling with eFDR filtering is illustrated below.

# find co-accessible ACRs with eFDR
obj <- coAccess(obj, genome=genome, eFDR=TRUE, fdr_thresh=0.05, verbose=T)

# co-accessible ACRs are stored in the 'obj$coACR' slot. 
head(obj$coACRs)