knitr::opts_chunk$set( collapse = FALSE, comment = ">" )
htmltools::img(src = knitr::image_uri(system.file(file.path("extdata", "chraccr_logo.png"), package="ChrAccR")), alt = 'logo', style = 'position:absolute; top:0; right:0; padding:10px; height:96px;')
ChrAccR
is an R package that provides tools for the comprehensive analysis chromatin accessibility data. The package implements methods for data quality control, exploratory analyses (including unsupervised methods for dimension reduction, clustering and quantifying transcription factor activities) and the identification and characterization of differentially accessible regions. It can be used for the analysis of large bulk datasets comprising hundreds of samples as well as for single-cell datasets with 10s to 100s of thousands of cells.
Requiring only a limited set of R commands, ChrAccR
generates analysis reports that can be interactively explored, facilitate a comprehensive analysis overview of a given dataset and are easily shared with collaborators. The package is therefore particularly useful for users with limited bioinformatic expertise, researchers new to chromatin analysis or institutional core facilities providing ATAC-seq as a service. Additionally, the package provides numerous utility functions for custom R scripting that allow more in-depth analyses of chromatin accessibility datasets.
This vignette focuses on the analysis of ATAC-seq data. This data can be directly imported from aligned sequencing reads into a DsATAC
object, the main data structure for working with ATAC-seq data. Using a small example dataset of chromatin accessibility in human immune cells, this vignette covers filtering, normalizing and aggregating count data across genomic regions of interest and introduces convenient utility functions for quality control (QC), data exploration and differential analysis. Additionally, ChrAccR
facilitates the automatic generation of analysis reports using several preconfigured workflows.
We first provide instructions for running the entire analysis in a ''vanilla'' workflow and then dive a little deeper into the underlying data structures and custom analysis methods that ChrAccR
provides for highly individualized analyses.
To install ChrAccR
and its dependencies, use the devtools
installation routine:
# install devtools if not previously installed if (!is.element('devtools', installed.packages()[,"Package"])) install.packages('devtools') # install dependencies devtools::install_github("demuellae/muLogR") devtools::install_github("demuellae/muRtools") devtools::install_github("demuellae/muReportR") # install ChrAccR devtools::install_github("GreenleafLab/ChrAccR")
We also provide complementary packages that contain genome-specific annotations. We highly recommend installing these annotation packages in order to speed up certain steps of the analysis:
# hg38 annotation package install.packages("https://muellerf.s3.amazonaws.com/data/ChrAccR/data/annotation/ChrAccRAnnotationHg38_0.0.1.tar.gz")
Loading ChrAccR
works just like loading any other R package
library(ChrAccR)
The ChrAccRex
data package contains a small example ATAC-seq datasets for ChrAccR
. If you haven't already installed ChrAccRex
use
devtools::install_github("GreenleafLab/ChrAccRex")
Note that many of the plots created in this vignette use the excellent ggplot2
package. The following code loads the package and sets a different plotting theme:
library(ggplot2) # use a grid-less theme theme_set(muRtools::theme_nogrid())
We start by running the entire analysis pipeline on an example dataset. After annotating the samples to be used in the analysis and specifying a couple of analysis parameters, this can essentially be done using just one command (run_atac(...)
).
We provide an example dataset of ATAC-seq profiles of a subset of 33 human T cell samples taken from [@Calderon:2019cb]. This dataset contains BAM files as well as a sample annotation table. If you have not already done so, please download and unzip the dataset:
# download and unzip the dataset datasetUrl <- "https://s3.amazonaws.com/muellerf/data/ChrAccR/data/tutorial/tcells.zip" downFn <- "tcells.zip" download.file(datasetUrl, downFn) unzip(downFn, exdir=".")
A properly curated sample annotation table is essential for the entire analysis. Such an annotation at the minimum should contain sample identifiers, but ideally contains all sorts of available information for each sample, such as phenotypical and technical metadata. This especially includes information relevant for sample gouping and differential analysis. In the example dataset, we provide such an annotation table:
# prepare the sample annotation table sampleAnnotFn <- file.path("tcells", "samples.tsv") bamDir <- file.path("tcells", "bam") sampleAnnot <- read.table(sampleAnnotFn, sep="\t", header=TRUE, stringsAsFactors=FALSE) # add a column that ChrAccR can use to find the correct bam file for each sample sampleAnnot[,"bamFilenameFull"] <- file.path(bamDir, sampleAnnot[,"bamFilename"])
Let us now also set a couple of analysis parameters and reporting preferences. These options can be set using the setConfigElement
function and can be queried using getConfigElement
. A list of all available options is contained in the documentation of setConfigElement
(e.g. use ?setConfigElement
). Here, we tell ChrAccR which annotation columns to focus on, which color schemes to use for some of these annotation, set filtering thresholds for low-coverage regions, the normalization method to be used and specify what comparisons to use for differential analysis:
setConfigElement("annotationColumns", c("cellType", "stimulus", "donor")) setConfigElement("colorSchemes", c( getConfigElement("colorSchemes"), list( "stimulus"=c("U"="royalblue", "S"="tomato"), "cellType"=c("TCD8naive"="purple3", "TCD8EM"="purple4", "TeffNaive"="seagreen3", "TeffMem"="seagreen4") ) )) setConfigElement("filteringCovgCount", 1L) setConfigElement("filteringCovgReqSamples", 0.25) setConfigElement("filteringSexChroms", TRUE) setConfigElement("normalizationMethod", "quantile") diffCompNames <- c( "TeffNaive_U vs TeffMem_U [sampleGroup]", "TCD8naive_U vs TCD8EM_U [sampleGroup]" ) setConfigElement("differentialCompNames", diffCompNames)
ChrAccR
uses sets of predefined genomic regions to summarize count data. By default, this includes 200bp-tiling windows. The following code generates a list of custom region sets that we can optionally use in the analysis. Here, we use 500-bp tiling regions a consensus set of ATAC-seq peaks that were identified in a large cancer dataset [@Corces:2018cp].
# Download an ATAC peak set from a Pan-cancer dataset [doi:10.1126/science.aav1898] cancerPeaks <- read.table("https://api.gdc.cancer.gov/data/116ebba2-d284-485b-9121-faf73ce0a4ec", sep="\t", header=TRUE, comment.char="", quote="", stringsAsFactors=FALSE) cancerPeaks <- GenomicRanges::GRanges(cancerPeaks[,"seqnames"], IRanges::IRanges(start=cancerPeaks[,"start"], end=cancerPeaks[,"end"], names=cancerPeaks[,"name"])) cancerPeaks <- sort(cancerPeaks) # sort # a list of custom region sets regionSetList <- list( tiling500bp = muRtools::getTilingRegions("hg38", width=500L, onlyMainChrs=TRUE), cancerPeaks = cancerPeaks )
After these preparatory steps, we can run through the default ChrAccR
workflow with a single command using the run_atac
function. Supplied with the sample annotation table and a reference to BAM files (or BED files containing fragments), it will automatically run the entire pipeline including importing the dataset, quality control analysis, filtering, count normalization, exploratory analysis and (optionally) differential accessibility analysis.
run_atac("ChrAccR_analysis", "bamFilenameFull", sampleAnnot, genome="hg38", sampleIdCol="sampleId", regionSets=regionSetList)
ChrAccR
data structuresChrAccR
implements data structures for the convenient handling of chromatin accessibility data. Its main class is the DsAcc
which represents the prototype for more specialized S4
classes for different types of accessibility protocols. Here, we focus on ATAC-seq data and the corresponding data structure is called DsATAC
. The example data loaded above is such a DsATAC
object.
Each DsATAC
object contains
For playing around with the ChrAccR
datastructures, we use a reduced version of the [@Calderon:2019cb] dataset which focuses on only a subset of the human genome (approximately 1%). It is provided in the ChrAccRex
package and can be loaded using:
dsa <- ChrAccRex::loadExample("dsAtac_ia_example")
To get an overview of what information is stored in the example dataset, you can just type the object name:
dsa
DsATAC
datasetsHere are some utility functions that summarize the samples contained in the dataset:
# get the number of samples length(dsa) # get the sample names head(getSamples(dsa)) # sample annotation table str(getSampleAnnot(dsa))
To obtain information on the region sets that are contained in the dataset, use:
# the names of the different region types getRegionTypes(dsa) # number of regions for a particular region type getNRegions(dsa, "promoters_gc_protein_coding") # genomic coordinates of a particular region type getCoord(dsa, "promoters_gc_protein_coding")
You can obtain a region-by-sample matrix containing Tn5 insertion counts using the getCounts
function:
cm <- getCounts(dsa, "promoters_gc_protein_coding") str(cm)
If the dataset contains fragment information, you can obtain the coordinates of sequencing fragments for each sample:
# number of fragments for the first 10 samples getFragmentNum(dsa, sampleIds=getSamples(dsa)[1:10]) # start and end coordinates of all fragments for a given sample getFragmentGr(dsa, "TCD8EM_U_1002") # all Tn5 insertion sites (i.e. the concatenation of fragment start and end coordinates) # for a set of samples getInsertionSites(dsa, getSamples(dsa)[1:3])
DsATAC
datasetsChrAccR
provides a number of functions for filtering a dataset and for adding additional data.
For instance, if you want to work with only a subset of the data, you can remove samples and genomic regions from the dataset. The example contains data for resting (unstimulated) and stimulated T cells. The following code removes the stimulated cells and retains only the resting state cells:
isStim <- getSampleAnnot(dsa)[,"stimulus"] == "S" dsa_resting <- removeSamples(dsa, isStim) dsa_resting
Sometimes you will want to focus on certain regions. For instance the following code filters the dataset, such that only the most variably accessible peaks are retained
peakCounts <- getCounts(dsa, "IA_prog_peaks") # identify the 1000 most variable peaks isVar <- rank(-matrixStats::rowVars(peakCounts)) <= 1000 # remove the less-variable peaks from the dataset dsa_varPeaks <- removeRegions(dsa, !isVar, "IA_prog_peaks") getNRegions(dsa_varPeaks, "IA_prog_peaks")
There are also shortcuts for filtering certain regions from the dataset, for example:
# remove those regions that do not have at least 10 insertions in at least 75% of the samples dsa_filtered <- filterLowCovg(dsa, thresh=10L, reqSamples=0.75) # remove regions on chromosome 21 dsa_filtered <- filterChroms(dsa, exclChrom=c("chr21"))
To remove a region type entirely, use removeRegionType
:
dsa_not10k <- removeRegionType(dsa, "t10k") getRegionTypes(dsa_not10k)
ChrAccR
provides several methods for normalizing and adjusting the region count data. These methods can be accessed through the transformCounts
function:
# quantile normalization for all region types dsa_qnorm <- transformCounts(dsa, method="quantile") # RPKM normalization of peak counts dsa_rpkm <- transformCounts(dsa, method="RPKM", regionTypes=c("IA_prog_peaks")) # log2(RPKM) dsa_l2rpkm <- transformCounts(dsa_rpkm, method="log2", regionTypes=c("IA_prog_peaks")) # DESeq2 variance stabilizing transformation (vst) dsa_vst <- transformCounts(dsa, method="vst", regionTypes=c("IA_prog_peaks")) # correct for potential batch effects based on the annotated donor dsa_vst <- transformCounts(dsa_vst, method="batchCorrect", batch=getSampleAnnot(dsa_qnorm)[,"donor"], regionTypes=c("IA_prog_peaks"))
Sometimes you want to aggregate count data over a region set after the DsATAC
object has been created. Maybe you came up with a better peak set or your collaborator provided you with a highly interesting list of putative enhancer elements. These types of region sets can be easily added to DsATAC
objects using the regionAggregation
function. It is most useful, if the dataset contains insertion/fragment data. The following example downloads a set of putative regulatory elements defined by the Ensembl Regulatory Build [@Zerbino:2015bx] of BLUEPRINT epigenome data (http://www.blueprint-epigenome.eu/) and aggregates the ATAC-seq insertions within these regions.
# download the file and prepare a GRanges object regionFileUrl <- "http://medical-epigenomics.org/software/rnbeads/materials//data/regiondb/annotation_hg38_ensembleRegBuildBPall.RData" downFn <- tempfile(pattern="regions", fileext="RData") download.file(regionFileUrl, downFn) regionEnv <- new.env() load(downFn, envir=regionEnv) regionGr <- unlist(regionEnv$regions) # aggregate insertions across the new region type dsa_erb <- regionAggregation(dsa, regionGr, "ERB", signal="insertions") dsa_erb
You can also merge samples, e.g. from different technical or biological replicates. In the following example, we merge all biological replicates (originating from different donors) for the same cell type and stimulation conditions. This grouping information is contained in the sampleGroup
annotation column. By default, counts will be added up for each aggregated region, but you can specify other aggregation methods such as the mean or median in the countAggrFun
argument.
dsa_merged <- mergeSamples(dsa, "sampleGroup", countAggrFun="sum") getSamples(dsa_merged)
In many analysis workflow, you want to save an R dataset to disk in order to use it in downstream analysis using custom R scripts. Note that R's workflow of saveRDS
readRDS
is generally not applicable for saving and loading DsATAC
datasets, because ChrAccR
optionally uses disk-backed data structures in order to be more memory efficient in the analysis. Instead, please use ChrAccR
's functions:
dest <- file.path(tempdir(), "myDsAtacDataset") saveDsAcc(dsa, dest) dsa_reloaded <- loadDsAcc(dest) dsa_reloaded
If you ran the vanilla analysis using ` as described above, you automatically generated and saved
DsATACdatasets. They are stored in the
datasubdirectory of your analysis directory (
ChrAccR_analysis`):
dsa_vanilla <- loadDsAcc(file.path("ChrAccR_analysis", "data", "dsATAC_processed")) dsa_vanilla
bam
filesATAC-seq data can be read from BAM files that contain aligned reads. These reads originate from fragments whose endpoints resemble Tn5 insertion sites. All ChrAccR
needs to read the input files is a sample annotation table and pointers to the BAM files. The following code reads the sample annotation table from the supplied text file and adds a column that contains the BAM file locations on your disk for each sample. Additionally, we prepare a list of region sets (genome-wide tiling regions with window sizes of 500bp and 5kb) that we want to aggregate the insertion counts over. Using the dataset downloaded and prepared in the 'Vanilla analysis' section of this vignette, we can easily create a DsATAC
dataset using the DsATAC.bam
function. Supplied with the sample annotation table and a reference to the BAM files, it will automatically determine Tn5 insertion sites from the BAM files and aggregate counts over the optionally supplied list of region sets (if no such list is provided, 200bp-tiling windows will be used as the default):
dsa_fromBam <- DsATAC.bam(sampleAnnot, "bamFilenameFull", "hg38", regionSets=regionSetList, sampleIdCol="sampleId")
ChrAccR
provides a way to bundle certain types of analyses into workflows that automatically create analysis reports. These HTML-based analysis reports can be viewed in any web browser and provide a convenient overview of the analysis results that can easily shared with collaborators. ChrAccR
provides the following workflows:
summary
: dataset characterization and quality controlexploratory
: exploratory analyses such as dimension reduction, clustering, TF motif enrichment, etc.differential
: differential accessibility analysis between groups of samplesProvided with a DsATAC
dataset, the workflows can be started using the createReport_X()
functions (where X
is the name of the workflow). Workflow options can be specified using the setConfigElement()
function for setting ChrAccR
's options. Here are some option examples:
# exclude the promoter and 't10k' (10kb tiling windows) region types from the analysis setConfigElement("regionTypes", setdiff(getRegionTypes(dsa), c("promoters_gc_protein_coding", "t10k"))) # see the current option setting getConfigElement("regionTypes") # use the following sample annotation columns for exploratory analyses setConfigElement("annotationColumns", c("cellType", "donor", "stimulus")) # add a custom color scheme for the 'stimulus' and 'cellType' annotation columns setConfigElement("colorSchemes", c( getConfigElement("colorSchemes"), list( "stimulus"=c("U"="#7fbc41", "S"="#de77ae"), "cellType"=c("TCD8naive"="#9ecae1", "TCD8EM"="#08519c", "TeffNaive"="#016c59", "TeffMem"="#014636") ) ))
To see which options are available use the help function for setConfigElement
?setConfigElement
The createReport_X()
take a DsATAC
dataset as input and write their results to a report directory. Not that this directory can only contain one report for each analysis workflow and the function will result in an error if the directory already contains a report of that type.
To get an overview of an ATAC-seq dataset use createReport_summary()
. The resulting report will contain a summary of the sample annotation, genomic coverage by sequencing reads and QC metrics such as a summary of the fragment size distribution and enrichment of Tn5 insertions at transcription start sites (TSS enrichment).
reportDir <- file.path(".", "ChrAccR_reports") # create the report (takes ~10 min on the example dataset) createReport_summary(dsa, reportDir)
The exploratory analysis reports comprises unsupervised analysis results, such as dimension reduction plots and sample clustering. Additionally, TF motif activity for peak region sets (those region sets that contain 'peak'
in their name) is assessed using the chromVAR
package [@Schep:2017je].
# create the report (takes ~1 min on the example dataset) createReport_exploratory(dsa_qnorm, reportDir)
The differential analysis workflow focuses on the pairwise comparison between groups of samples. The resulting report contains summary tables and plot for each comparison (e.g. volcano plots). Differentially accessible regions are identified using DESeq2
and multiple p-value and rank cutoffs.
# differential analysis option settings setConfigElement("differentialColumns", c("stimulus", "cellType")) # adjust for the donor annotation in the differential test setConfigElement("differentialAdjColumns", c("donor")) # create the report (takes ~18 min on the example dataset) createReport_differential(dsa, reportDir)
To quickly obtain a dimension reduction plot based on a specific set of regions in a DsATAC
dataset, you can use the getDimRedCoords.X
functions from the muRtools
package:
cm <- getCounts(dsa_qnorm, "IA_prog_peaks") pcaCoord <- muRtools::getDimRedCoords.pca(t(cm)) muRtools::getDimRedPlot(pcaCoord, annot=getSampleAnnot(dsa_qnorm), colorCol="cellType", shapeCol="stimulus")
The fragment size distribution of an ATAC-seq can provide valuable quality control information. Typically, we expect a large number of fragments with sub-nucleosomal length (length < 200bp) and further peaks corresponding to 1,2, ... nucleosomes (roughly multiples of 200bp in length). Here is how you can plot a corresponding density estimation of the fragment length distribution for a specific sample in the dataset.
plotInsertSizeDistribution(dsa, "TCD8EM_U_1002")
The enrichment of the number of fragments at transcription start sites (TSS) over genomic background is a valuable indicator for the signal-to-noise ratio in ATAC-seq data. In ChrAccR
, the TSS enrichment can be computed for each sample in a dataset using the getTssEnrichment
function. You will need the coordinates of TSSs in the genome as input and the function will return an object comprising a ggplot
object as well as a numeric score that quantifies the enrichment at the promoter (default definition: TSS +/- 2kb) vs a background (default: 100bp outside the promoter window). Absolute and smoothed counts will be used for the computation.
# prepare a GRanges object of TSS coordinates tssGr <- muRtools::getAnnotGrl.gencode("gencode.v27")[["gene"]] tssGr <- tssGr[elementMetadata(tssGr)[,"gene_type"]=="protein_coding"] tssGr <- promoters(tssGr, upstream=0, downstream=1) tssGr # compute TSS enrichment tsse <- getTssEnrichment(dsa, "TCD8EM_U_1002", tssGr) # enrichment score: number of insertions at the TSS # over number of insertion in background regions tsse$tssEnrichment # plot tsse$plot
daTab <- getDiffAcc(dsa, "IA_prog_peaks", "stimulus", grp1Name="S", grp2Name="U", adjustCols=c("cellType", "donor"))
ChromVAR
Genome-wide aggregates across potential binding sites for transcription factors can be extremely useful in the interpretation of chromatin accessibility. The chromVAR
package is designed to quantify the overall TF motif accessibility for each sample in a dataset [@Schep:2017je]. Using the getChromVarDev
function, we can obtain these deviation scores from DsATAC
datasets:
cvRes <- getChromVarDev(dsa, "IA_prog_peaks", motifs="jaspar") devZ <- chromVAR::deviations(cvRes) str(devZ) # plot a clustered heatmap of the first 20 motifs pheatmap::pheatmap(devZ[1:20,])
To get an overview of the activity of certain transcription factors, you can generate footprint plots which are aggreagtions of insertion counts in windows surrounding all occurrences of a motif genome-wide. ChrAccR normalizes these counts using kmer frequencies and also plots the background distribution using all insertion sites (not just the ones occurring near motifs). You can compute the data used to generate these plots using the getMotifFootprints
function. Note that currently the computation of the kmer biases is quite compute intense. We therefor recommend to limit the analysis to a few samples and motifs.
motifNames <- c("MA1419.1_IRF4", "MA0139.1_CTCF", "MA0037.3_GATA3") # motifNames <- grep("(IRF4|CTCF|GATA3)$", names(prepareMotifmatchr("hg38", "jaspar")$motifs), value=TRUE) # alternative by searching for patterns samples <- c("TeffNaive_U_1001", "TeffNaive_U_1002", "TeffMem_U_1001", "TeffMem_U_1002") fps <- getMotifFootprints(dsa, motifNames, samples) # the results of the footprinting as data.frame for later plotting str(fps[["MA0139.1_CTCF"]]$footprintDf) # the result also contains a plot object, which can directly be drawn fps[["MA0139.1_CTCF"]]$plot
ChrAccR
offers utility functions to convert accessibility data to other commonly used data structures for Bioconductor-based workflows. For instance, to export count data aggregated over a certain region type as a SummarizedExperiments
object you can use:
se <- getCountsSE(dsa, "IA_prog_peaks") se
To export data as a DESeq2
dataset for differential accessibility analysis:
dds <- getDESeq2Dataset(dsa, "IA_prog_peaks", designCols=c("donor", "stimulus", "cellType")) dds
Sys.Date()
sessionInfo()
link-citations: yes references: - id: Zerbino:2015bx title: The Ensembl Regulatory Build author: - family: Zerbino given: Daniel R - family: Wilder given: Steven P - family: Johnson given: Nathan - family: Juettemann given: Thomas - family: Flicek given: Paul R container-title: Genome Biology DOI: 10.1186/s13059-015-0621-5 volume: 16 number: 1 page: 56 issued: year: 2015 type: article-journal - id: Schep:2017je title: "chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data" author: - family: Schep given: Alicia N - family: Wu given: Beijing - family: Buenrostro given: Jason D - family: Greenleaf given: William J container-title: Nature methods DOI: 10.1038/nmeth.4401 volume: 14 number: 10 page: 975-978 issued: year: 2017 type: article-journal - id: Calderon:2019cb title: Landscape of stimulation-responsive chromatin across diverse human immune cells author: - family: Calderon - family: Nguyen - family: Mezger - family: Kathiria - family: Müller - family: Nguyen - family: Lescano - family: Wu - family: Trombetta - family: Ribado - family: Knowles - family: Gao - family: Blaeschke - family: Parent - family: Burt - family: Anderson - family: Criswell - family: Greenleaf - family: Marson - family: Pritchard container-title: Nature Genetics DOI: 10.1038/s41588-019-0505-9 volume: 51 number: 10 page: 1494-1505 issued: year: 2019 type: article-journal - id: Corces:2018cp title: The chromatin accessibility landscape of primary human cancers author: - family: Corces - family: Granja - family: Shams - family: Louie - family: Seoane - family: Zhou - family: Silva - family: Groeneveld - family: Wong - family: Cho - family: Satpathy - family: Mumbach - family: Hoadley - family: Robertson - family: Sheffield - family: Felau - family: Castro - family: Berman - family: Staudt - family: Zenklusen - family: Laird - family: Curtis - family: Cancer Genome Atlas Network - family: Greenleaf - family: Chang container-title: Science DOI: 10.1126/science.aav1898 volume: 362 number: 6413 page: eaav1898 issued: year: 2018 type: article-journal
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.