knitr::opts_chunk$set(echo = TRUE)
For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of adult mouse brain cells provided by 10x Genomics. The following files are used in this vignette, all available through the 10x Genomics website.
View data download code
To download the required data, run the following lines in a shell:
```{bash eval=FALSE} wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5 wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_singlecell.csv wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz.tbi
</details> This vignette echoes the commands run in the introductory Signac vignette on [human PBMC](pbmc_vignette.html). We provide the same analysis in a different system to demonstrate performance and applicability to other tissue types, and to provide an example from another species. First load in Signac, Seurat, and some other packages we will be using for analyzing mouse data. ```r if (!requireNamespace("EnsDb.Mmusculus.v79", quietly = TRUE)) BiocManager::install("EnsDb.Mmusculus.v79")
library(Signac) library(Seurat) library(EnsDb.Mmusculus.v79) library(ggplot2) library(patchwork)
counts <- Read10X_h5("../vignette_data/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5") metadata <- read.csv( file = "../vignette_data/atac_v1_adult_brain_fresh_5k_singlecell.csv", header = TRUE, row.names = 1 ) brain_assay <- CreateChromatinAssay( counts = counts, sep = c(":", "-"), genome = "mm10", fragments = '../vignette_data/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz', min.cells = 1 ) brain <- CreateSeuratObject( counts = brain_assay, assay = 'peaks', project = 'ATAC', meta.data = metadata )
We can also add gene annotations to the brain
object for the mouse genome.
This will allow downstream functions to pull the gene annotation information
directly from the object.
# extract gene annotations from EnsDb annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79) # change to UCSC style since the data was mapped to hg19 seqlevels(annotations) <- paste0('chr', seqlevels(annotations)) genome(annotations) <- "mm10" # add the gene information to the object Annotation(brain) <- annotations
Next we compute some useful per-cell QC metrics.
brain <- NucleosomeSignal(object = brain)
We can look at the fragment length periodicity for all the cells, and group by cells with high or low nucleosomal signal strength. You can see that cells which are outliers for the mononucleosomal/ nucleosome-free ratio have different banding patterns. The remaining cells exhibit a pattern that is typical for a successful ATAC-seq experiment.
brain$nucleosome_group <- ifelse(brain$nucleosome_signal > 4, 'NS > 4', 'NS < 4') FragmentHistogram(object = brain, group.by = 'nucleosome_group', region = 'chr1-1-10000000')
The enrichment of Tn5 integration events at transcriptional start sites (TSSs)
can also be an important quality control metric to assess the targeting of Tn5
in ATAC-seq experiments. The ENCODE consortium defined a TSS enrichment score as
the number of Tn5 integration site around the TSS normalized to the number of
Tn5 integration sites in flanking regions. See the ENCODE documentation for more
information about the TSS enrichment score
(https://www.encodeproject.org/data-standards/terms/). We can calculate the TSS
enrichment score for each cell using the TSSEnrichment()
function in Signac.
brain <- TSSEnrichment(brain, fast = FALSE)
brain$high.tss <- ifelse(brain$TSS.enrichment > 2, 'High', 'Low') TSSPlot(brain, group.by = 'high.tss') + NoLegend()
brain$pct_reads_in_peaks <- brain$peak_region_fragments / brain$passed_filters * 100 brain$blacklist_ratio <- brain$blacklist_region_fragments / brain$peak_region_fragments VlnPlot( object = brain, features = c('pct_reads_in_peaks', 'peak_region_fragments', 'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'), pt.size = 0.1, ncol = 5 )
We remove cells that are outliers for these QC metrics.
brain <- subset( x = brain, subset = peak_region_fragments > 3000 & peak_region_fragments < 100000 & pct_reads_in_peaks > 40 & blacklist_ratio < 0.025 & nucleosome_signal < 4 & TSS.enrichment > 2 ) brain
brain <- RunTFIDF(brain) brain <- FindTopFeatures(brain, min.cutoff = 'q0') brain <- RunSVD(object = brain)
The first LSI component often captures sequencing depth (technical variation)
rather than biological variation. If this is the case, the component should be
removed from downstream analysis. We can assess the correlation between each LSI
component and sequencing depth using the DepthCor()
function:
DepthCor(brain)
Here we see there is a very strong correlation between the first LSI component and the total number of counts for the cell, so we will perform downstream steps without this component.
Now that the cells are embedded in a low-dimensional space, we can use methods
commonly applied for the analysis of scRNA-seq data to perform graph-based
clustering, and non-linear dimension reduction for visualization. The functions
RunUMAP()
, FindNeighbors()
, and FindClusters()
all come from the Seurat
package.
brain <- RunUMAP( object = brain, reduction = 'lsi', dims = 2:30 ) brain <- FindNeighbors( object = brain, reduction = 'lsi', dims = 2:30 ) brain <- FindClusters( object = brain, algorithm = 3, resolution = 1.2, verbose = FALSE ) DimPlot(object = brain, label = TRUE) + NoLegend()
# compute gene activities gene.activities <- GeneActivity(brain) # add the gene activity matrix to the Seurat object as a new assay brain[['RNA']] <- CreateAssayObject(counts = gene.activities) brain <- NormalizeData( object = brain, assay = 'RNA', normalization.method = 'LogNormalize', scale.factor = median(brain$nCount_RNA) )
DefaultAssay(brain) <- 'RNA' FeaturePlot( object = brain, features = c('Sst','Pvalb',"Gad2","Neurod6","Rorb","Syt6"), pt.size = 0.1, max.cutoff = 'q95', ncol = 3 )
To help interpret the scATAC-seq data, we can classify cells based on an scRNA-seq experiment from the same biological system (the adult mouse brain). We utilize methods for cross-modality integration and label transfer, described here, with a more in-depth tutorial here.
You can download the raw data for this experiment from the Allen Institute website, and view the code used to construct this object on GitHub. Alternatively, you can download the pre-processed Seurat object here.
# Load the pre-processed scRNA-seq data allen_rna <- readRDS("../vignette_data/allen_brain.rds") allen_rna <- UpdateSeuratObject(allen_rna) allen_rna <- FindVariableFeatures( object = allen_rna, nfeatures = 5000 ) transfer.anchors <- FindTransferAnchors( reference = allen_rna, query = brain, reduction = 'cca', dims = 1:30 ) predicted.labels <- TransferData( anchorset = transfer.anchors, refdata = allen_rna$subclass, weight.reduction = brain[['lsi']], dims = 2:30 ) brain <- AddMetaData(object = brain, metadata = predicted.labels)
plot1 <- DimPlot(allen_rna, group.by = 'subclass', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scRNA-seq') plot2 <- DimPlot(brain, group.by = 'predicted.id', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scATAC-seq') plot1 + plot2
You can see that the RNA-based classifications are consistent with the UMAP visualization, computed only on the ATAC-seq data.
Here, we find differentially accessible regions between excitatory neurons in different layers of the cortex.
#switch back to working with peaks instead of gene activities DefaultAssay(brain) <- 'peaks' Idents(brain) <- "predicted.id" da_peaks <- FindMarkers( object = brain, ident.1 = c("L2/3 IT"), ident.2 = c("L4", "L5 IT", "L6 IT"), test.use = 'LR', latent.vars = 'nCount_peaks' ) head(da_peaks)
plot1 <- VlnPlot( object = brain, features = rownames(da_peaks)[1], pt.size = 0.1, idents = c("L4","L5 IT","L2/3 IT") ) plot2 <- FeaturePlot( object = brain, features = rownames(da_peaks)[1], pt.size = 0.1, max.cutoff = 'q95' ) plot1 | plot2
open_l23 <- rownames(da_peaks[da_peaks$avg_log2FC > 3, ]) open_l456 <- rownames(da_peaks[da_peaks$avg_log2FC < 3, ]) closest_l23 <- ClosestFeature(brain, open_l23) closest_l456 <- ClosestFeature(brain, open_l456) head(closest_l23)
head(closest_l456)
We can also create coverage plots grouped by cluster, cell type, or any other
metadata stored in the object for any genomic region using the CoveragePlot()
function. These represent pseudo-bulk accessibility tracks, where signal from
all cells within a group have been averaged together to visualize the DNA
accessibility in a region.
# show cell types with at least 50 cells idents.plot <- names(which(table(Idents(brain)) > 50)) CoveragePlot( object = brain, region = c("Neurod6", "Gad2"), idents = idents.plot, extend.upstream = 1000, extend.downstream = 1000, ncol = 1 )
saveRDS(object = brain, file = "../vignette_data/adult_mouse_brain.rds")
Session Info
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.