In plantformatics/Socrates: Analysis of scATAC-seq data

knitr::opts_chunk$set(echo = TRUE)

Loading data

After aligning scATAC-seq reads to a reference genome, its important to evaluate data quality. Socrates allows facile loading and analysis of scATAC-seq data, requiring only a BED file with the Tn5 integration sites (analagous to the ends of the fragments.tsv file from 10X) with 5 columns (chr start end barcode strand), a GFF file with gene coordinates, and a chromosome sizes file with one chromosome per line (chr chr_length). The output is obj with several slots (1) obj$bed which contains the single-bp Tn5 integration sites for each barcode; (2) obj$gff genome GFF annotation as TxDb object; (3) obj$chr a data.frame with the chromosome sizes of the genome. If MACS2 is not in your path, you will receive an error and will be booted from the session. Make sure MACS2 is available!! We will use some freely available data from 10X Genomics that is provided with this package. The Tn5 integration site bed file has been subset to 2 million Tn5 sites for quick processing and to circumvent limitations on file size uploads on GitHub.

# load Socrates
library(Socrates)

# file paths
bed <- system.file("extdata", "test.tn5.bed.gz", package = "Socrates")
ann <- system.file("extdata", "gencode.v19.annotation.gff3.gz", package = "Socrates")
chr <- system.file("extdata", "hg19.txt", package = "Socrates")

# load data
obj <- loadBEDandGenomeData(bed, ann, chr)

Identify ACRs

A particularly useful metric of data quality can be estimated by counting the proportion of Tn5 integration sites that are localized to significantly enriched regions of the genome. To estimate the fraction of reads in peaks (FRiP), Socrates can run MACS2 on the bulk Tn5 integration sites and count the number of integration sites per barcode. Code for identifying ACRs is shown below:

# call ACRs
obj <- callACRs(obj, genomesize=2.5e9, 
                shift= -50, 
                extsize=100,
                fdr=0.05,
                output="bulk_peaks", 
                tempdir="./macs2_temp", 
                verbose=T)

Here, we specify the total mappable sequences for the target genome (for maize, this value is ~1.6e9), shift the 1bp Tn5 integration sites -50 bp, and extend the pseudoframent 100bp. This results in a pseudofragment centered on the Tn5 integration site 100 bp in length that is used for identifying genomic regions with significant enrichment at an FDR at 0.05.

Building the meta data files

Once we have identified ACRs, we can now start to build the metadata file. Here we use a 2-kb window surrounding TSS for estimating the proportion of Tn5 integration sites near genes.

# build metadata
obj <- buildMetaData(obj, tss.window=2000, verbose=TRUE)

# view object
head(obj$meta)

Filtering low-quality cells

The metadata slot contains valuable information for removing low-quality barcodes. Filtering cells based on total read depth (at least 1000 Tn5 integration sites), maximum number of cells (16000), proportion of reads mapping to TSSs (defaults to 0.2 and z-score of 3), and barcode FRiP scores (defaults to 0.1 and z-score of 2) is shown below. You can save the images to a PDF by setting the prefix argument to a non-NULL character string.

# filter cells
obj <- findCells(obj, 
                 doplot=T,
                 max.cells=16000,
                 min.tn5=1000,
                 filt.tss=TRUE, 
                 filt.frip=TRUE)

Generating a `Socrates` object for downstream analysis

Finally, we can now build a sparse matrix using on the highest quality cells using either ACRs or the genome split into non-overlapping window. The filtered parameter species whether to use the filtered set of cells (TRUE) or the unfiltered cells (FALSE). The windows parameter denotes the tile size when using the entire genome to build the sparse matrix. If peaks is set to TRUE, ACRs are used to build the sparse matrix (which overrides the genomic windows function). Here will build the binary sparse matrix using 1-kb windows tiled over the whole genome.

# generate sparse matrix
obj <- generateMatrix(obj, filtered=T, windows=1000, peaks=F, verbose=T)

Finally, we can convert this object into one that can be used by Socrates for processing, clustering, and other downstream analysis (see here for more details).

# convert to Socrates format for downstream analysis. 
soc.obj <- convertSparseData(obj, verbose=T)

# save QC object
saveRDS(obj, file="QC_object.rds")

If you have multiple soc.obj objects (say from different replicates or experimental samples), you can merge the results into a single object. Just make sure that the different samples/objects will have a shared genome space - the same 1-kb windows or peaks when building the matrices with generateMatrix. Below is an example of merging two replicates which we name as rep1 and rep2. You can also merge socrates objects that have been saved as RDS files - run mergeSocratesRDS for more information on arguments.

# merge Socrates objects
object.list <- list(rep1=soc.obj1, rep2=soc.obj2)
merge.obj <- mergeSocratesRDS(obj.list=object.list)

# check object
str(merge.obj)

### Session Information

```r
sessionInfo()