knitr::opts_chunk$set(echo = TRUE)

Preprocessing and modeling chromatin accessibility with regularized quasibinomial logistic regression

scATAC-seq data is highly sparse and essentially binary for diploid cells, providing a significant challenge for downstream analyses. To mitigate these technical effects, we developed a model-based approach that we term Regularized quasibinomial logistic Regression for Single-Cell Chromatin Accessibility (Socrates). Inspired by innotation in scRNA-seq methods, namely the SCTransform function implemented in Seurat, Socrates removes technical variation confounding the accessibility signal from each cell and peak by fitting a generalized linear model between binary counts of chromatin accessibility and per-cell log10 read depths for each peak independently (y~x, where y represents a binary numeric vector of accessibility states across cells at a given peak, and x represents a vector of log10[sum of accessible peaks] per cell). Socrates explicitly avoids overfitting by sampling representative peaks with kernel regression, learning global quasibinomial model parameters (including a term for over-dispersion), and projecting the learned parameters to all peaks (regularization). The models are then used to extract Pearson's residuals that represent read depth-normalized chromatin accessibility profiles for each cell and peak. Technical Note: You will need access to 8G memory to run this example.


Load Socrates and raw data

Here, we demonstrate how to load the following inputs to create an Socrates object:

  1. binary sparse peak x cell matrix (a gzipped text file in triplet tsv format).
  2. meta-data saved as a tsv document containing various per-cell metrics.

Meta-data is not necessarily required, but useful for evaluating technical effects in clustering, as well as other downstream steps. An example of input data formats for 1,500 PBMC cells and how to construct an Socrates object from scratch is shown below.

# load library
suppressWarnings(library("Socrates"))

# specify paths to raw data in Socrates package
input <- system.file("extdata", "pbmc_atac_10x.1.5K_cells.sparse.gz", package="Socrates")
meta <- system.file("extdata", "pbmc_atac_10x.1.5K_cells.metadata.txt", package="Socrates")

# load raw data for viewing
input.format <- read.table(input)
meta.format <- read.table(meta)

# view
head(input.format)
head(meta.format)

# load data into Socrates object
Socrates.object <- loadSparseData(input=input, meta=meta, verbose=T)

Filter peaks and cells

For the remainder of this tutorial, we will work from a 5K PBMC data set publically available from the 10X Genomics website. A precompiled Socrates binary object containing this data is automatically available after loading the Socrates package.

str(obj)

To reduce the effects of outlier peaks and cells on clustering, it is often helpful to remove cells with few accessible peaks, and peaks with extreme accessibility profiles (i.e. peaks that are accessible in all, or very few cells). Specifically, the cell x peak matrix can be filtered by adjusting heurstic frequency thresholds after visual inspection of cell and peak accessibility distributions. Let's investigate these distributions further to determine reasonable thresholds for this particular data set.

# estimate log10 number of accessible regions per cell
cell.counts <- Matrix::colSums(obj$counts)

# estimate peak accessibility frequency across cells
site.freq <- Matrix::rowMeans(obj$counts)

# plot distributions
layout(matrix(c(1:2), ncol=2))
par(mar=c(3,3,1,1))
plot(density(cell.counts), main="log10 cell counts", log="x")
abline(v=1000, col="red")
plot(density(site.freq), main="peak accessibility frequency", log="x")

It appears that most cells have a median over around 7,000 accessible peaks. The cells on the lower tail of the distribution may reflect broken nuclei, so we'll remove cells with less than 1,000 open peaks from the analysis. The distribution of average peak accessibilities doesnt show any clear (lower-tail) cut-offs, therefore, we will use the default thresholds (min.t=0.001, max.t=0.005). The arguments min.t and max.t set the minimum peak accessibility frequency and upper (99.5%) quantile cut-offs, respectively.

# filter matrix 
obj <- cleanData(obj, min.c=1000, min.t=0.001, max.t=0.005, verbose=T)

Filtering peaks enriched in pre-computed clusters

NOTE If users have generated crude clusters, such as in silico sorting described by Cusanovich et al. 2018, the parameters min.p and preclusterID allow users to set minimum peak accessibility frequencies for pre-specied groups. Below, we constrain peaks to be accessible in at least 5% of cells in at least one crude_cluster. See ?cleanData for more details.

# simulate 10 random clusters, 
obj$meta$crude_clusters <- sample(factor(seq(1:10)), nrow(obj$meta), replace=T)
obj <- cleanData(obj, min.p=0.05, preclusterID="crude_clusters")

Normalization

With a filtered cell x peak matrix in hand, we can now calculate normalized accessibility scores (Pearson's residuals) across all cells and peaks using regularized quasibinomial logistic regression. Note that the function regModel can be parallelized by setting nthreads to a number greater than 1. Parallel implementations depend on the doSNOW library. In the example below, we set the number of threads to 4 to speed-up the analysis. We will run multiple normalization schemes to compare their performance.

# run regularized regression
obj <- regModel(obj, nthreads=4)

Reducing dimensions

Singular Value Decomposition (SVD)

After normalizing peak x cell chromatin accessibility profiles using one of the aforementioned methods, we can reduce the dimensionality of residual matrix to remove noise and better model cell-cell relationships in a reduced space. Dimensionality reduction is implemented via Singular Value Decomposition (SVD) from the irlba package. We will reduce the dimensions of all the normalization methods by specifically selecting the slots from each approach one-by-one.

# reduce dimensionality of the residual matrix
obj <- reduceDims(obj, n.pcs=50, cor.max=0.7, verbose=T)

The above commands estimate the first 50 singular values, and removes singular values correlated with technical variation (read depth) above a Spearman Correlation Coefficient of 0.7.

Projecting into a reduced dimensionality with UMAP

Similarity between cells is most easily visualized on two dimensions. Uniform Manifold Approximation Projection (UMAP) has gained popularity in single-cell approaches owing to its scalability and capacity for maintaining both global and local data structure, greatly aiding data interpretations. To generate UMAP embeddings, we can run the projectUMAP function which relies on uwot::umap:

# run projectUMAP
obj <- projectUMAP(obj, verbose=T)

We can quickly visualize the reduced embedding. As you can see below, they are quite similar. In cases where speed and memory usage are central factors, it may be more advisable to use TF-IDF normalization in place of Socrates. One benefit of using model-based approaches is that the normalized values can be readily interpretted for down-stream analyses.

# plot UMAP results
plot(obj$UMAP, pch=16, cex=0.2)

Graph-based clustering

Visualization of the UMAP embeddings suggests several groups of cells with distinct identities. We can cluster cells into groups using graph-based clustering by leaning on Louvain and Leiden clustering approaches provided by the popular Seurat package. Below is a wrapper for running graph-based clustering in the SVD space via Seurat. Cluster resolution can be adjusted from the default (0.4) using the res argument. Reasonable res values range from 0 to 2 (high values yield higher cluster numbers). Cluster membership is appended to the meta data.frame under the column 'LouvainClusters' by default. For a list of tuneable parameters, run ?callClusters in the R console.

# run clustering
obj <- callClusters(obj, res=0.6, verbose=T)

Plotting results

Having run the clustering algorithm, we can now visualize the different groupings on the UMAP embedding.

plotUMAP(obj)

Saving results

The Socrates object is updated iteratively after each processing step. To save results for sharing or future exploration, you can use the following command to save a snapshot of current stage of analysis:

saveRDS(obj, file="Socrates_object.rds")

Accessing results

Results are appended to the Socrates object after each function, as described above. Below describes the location of different data sets up to this point.


Session Information

sessionInfo()


plantformatics/Socrates documentation built on April 3, 2025, 1:02 p.m.