knitr::opts_chunk$set(echo = TRUE)
After aligning scATAC-seq reads to a reference genome, its important to evaluate data quality. Socrates
allows facile loading and analysis of scATAC-seq data, requiring only a BED file with the Tn5 integration sites (analagous to the ends of the fragments.tsv file from 10X) with 5 columns (chr start end barcode strand), a GFF file with gene coordinates, and a chromosome sizes file with one chromosome per line (chr chr_length). The output is obj
with several slots (1) obj$bed
which contains the single-bp Tn5 integration sites for each barcode; (2) obj$gff
genome GFF annotation as TxDb object; (3) obj$chr
a data.frame with the chromosome sizes of the genome. If MACS2 is not in your path, you will receive an error and will be booted from the session. Make sure MACS2 is available!! We will use some freely available data from 10X Genomics that is provided with this package. The Tn5 integration site bed file has been subset to 2 million Tn5 sites for quick processing and to circumvent limitations on file size uploads on GitHub.
# load Socrates library(Socrates) # file paths bed <- system.file("extdata", "test.tn5.bed.gz", package = "Socrates") ann <- system.file("extdata", "gencode.v19.annotation.gff3.gz", package = "Socrates") chr <- system.file("extdata", "hg19.txt", package = "Socrates") # load data obj <- loadBEDandGenomeData(bed, ann, chr)
A particularly useful metric of data quality can be estimated by counting the proportion of Tn5 integration sites that are localized to significantly enriched regions of the genome. To estimate the fraction of reads in peaks (FRiP), Socrates
can run MACS2 on the bulk Tn5 integration sites and count the number of integration sites per barcode. Code for identifying ACRs is shown below:
# call ACRs obj <- callACRs(obj, genomesize=2.5e9, shift= -50, extsize=100, fdr=0.05, output="bulk_peaks", tempdir="./macs2_temp", verbose=T)
Here, we specify the total mappable sequences for the target genome (for maize, this value is ~1.6e9), shift the 1bp Tn5 integration sites -50 bp, and extend the pseudoframent 100bp. This results in a pseudofragment centered on the Tn5 integration site 100 bp in length that is used for identifying genomic regions with significant enrichment at an FDR at 0.05.
Once we have identified ACRs, we can now start to build the metadata file. Here we use a 2-kb window surrounding TSS for estimating the proportion of Tn5 integration sites near genes.
# build metadata obj <- buildMetaData(obj, tss.window=2000, verbose=TRUE) # view object head(obj$meta)
The metadata slot contains valuable information for removing low-quality barcodes. Filtering cells based on total read depth (at least 1000 Tn5 integration sites), maximum number of cells (16000), proportion of reads mapping to TSSs (defaults to 0.2 and z-score of 3), and barcode FRiP scores (defaults to 0.1 and z-score of 2) is shown below. You can save the images to a PDF by setting the prefix
argument to a non-NULL character string.
# filter cells obj <- findCells(obj, doplot=T, max.cells=16000, min.tn5=1000, filt.tss=TRUE, filt.frip=TRUE)
Socrates
object for downstream analysisFinally, we can now build a sparse matrix using on the highest quality cells using either ACRs or the genome split into non-overlapping window. The filtered
parameter species whether to use the filtered set of cells (TRUE
) or the unfiltered cells (FALSE
). The windows
parameter denotes the tile size when using the entire genome to build the sparse matrix. If peaks
is set to TRUE
, ACRs are used to build the sparse matrix (which overrides the genomic windows function). Here will build the binary sparse matrix using 1-kb windows tiled over the whole genome.
# generate sparse matrix obj <- generateMatrix(obj, filtered=T, windows=1000, peaks=F, verbose=T)
Finally, we can convert this object into one that can be used by Socrates
for processing, clustering, and other downstream analysis (see here for more details).
# convert to Socrates format for downstream analysis. soc.obj <- convertSparseData(obj, verbose=T) # save QC object saveRDS(obj, file="QC_object.rds")
If you have multiple soc.obj
objects (say from different replicates or experimental samples), you can merge the results into a single object. Just make sure that the different samples/objects will have a shared genome space - the same 1-kb windows or peaks when building the matrices with generateMatrix
. Below is an example of merging two replicates which we name as rep1 and rep2. You can also merge socrates objects that have been saved as RDS files - run mergeSocratesRDS
for more information on arguments.
# merge Socrates objects object.list <- list(rep1=soc.obj1, rep2=soc.obj2) merge.obj <- mergeSocratesRDS(obj.list=object.list) # check object str(merge.obj) ### Session Information ```r sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.