suppressPackageStartupMessages({
  library(ATACseqQC)
  library(ChIPpeakAnno)
  library(BSgenome.Hsapiens.UCSC.hg19)
  library(TxDb.Hsapiens.UCSC.hg19.knownGene)
  library(MotifDb)
  library(GenomicAlignments)
})
knitr::opts_chunk$set(warning=FALSE, message=FALSE)

Introduction

Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) is a method to identify open chromatin regions in a genome-wide manner. This method involves using a transposase to cut sections of accessable DNA selectively [@buenostro2013transposition]. ATAC-seq identifies regions highly corrrelated with similar methods used to identify these regions, DNase-seq [@boyle2008high] and FAIRE-seq [@giresi2007faire]. ATAC-seq has become a popular method for this task because it is faster and easier to perform, has a higher signal to noise ration, and does not require as large of an amount of biological samples when comapred with other methods [@buenostro2013transposition].

Hidden Markov ModeleR for ATAC-seq (HMMRATAC) is a semi-supervised machine learning appraoch for identifying open chromatin regions from ATAC-seq data. The idea behind HMMRATAC is conceptualized upon the idea of "decomposition and integration. The ATAC-seq data is decomposed into different layers of coverage signals corresponding to the sequnced DNA fragments originated from nucleosomal regions. The relationship between the layers of signals at open chromatin regions is then learned using a Hidden Markov Model and utilized for predicting open chromatin. HMMRATAC is capable of outperforming similar methods in identifying chromatin structure and transcription factor binding sites. [@hmmratac2019tarbell]

A typical analysis pipeline begins with aligning sequencing reads to a reference genome, then using HMMRATAC to identify of accessible regions or "peaks" in the chromatin. Next motif enrichment can be performed with MEME, footprint identificaiton with CENTIPEDE, or differential analysis using Diffbind. Quality control measurments can take place after each of these steps using ATACseqQC.

This Bioconductor package has been made to utilize the original Java program for HMMRATAC and make it availiable for use within Bioconductor.

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install('HMMRATAC')

Example Pipeline

The rest of this vignette will detail how to use this package and others for the purpose of performing several types analysis of on an ATACSeq sample to confirm some experimental results.

Our example stems from the article "β-Glucan Reverses the Epigenetic State of LPS-Induced Immunological Tolerance" [@novakovic2016β-glucan]. In this article, it can be understand that monocytes undergo functional reprogramming as apart of the innate immune response after exposure to microbial lipopolysaccharide (LPS). LPS-treated monocytes fail to accumulate active histone markers in promoters ad enhancers of lipid metabolism and phagocytic pathways. In contrast, β-glucan reverses the LPS-induced tolerance and reinduces their capacity for cytokine production.

An integrated epigenomic apprach is envoked to characterize the molecular events involved with LPS-induced tolerance in a time-dependant manner. Time-dependant ATACseq was performed on naive macrophages, LPS-exposed tolerized macrophages, and BG-exposed trained macrophages. Measurements were taken at one hour, 4 hour, 24 hour, and 6 day time frames.

This ATACseq data is avaible from the Short Reads Archieve (SRA)

In this example pipeline, we seek to validate the findings of the article by performing transcription factor and functional analysis with the following steps: - Sequence alignment of reads using Bowtie2 - Use of Rsamtools to generate bam files and file necessary for peak-calling - Quality control using ATACseqQC - Using HMMRATAC for peak-calling - ChIPSeekeR - Using diffbind to perform differential binding ananlysis between the treatments - A pipeline to identify significant transcription factors using MotifDb and TFBSTools - A pipeline for functional enrichment analysis using rGREAT - gViz

Due to the size of the files and the complexity of computations being performed, most of the functionality of the vignette is not meant to be run by the user in the context of an example. Instead, the results of the computations will be included at the end of their section.

Pleae load the data from the HMMRATACData package for these computations

library(HMMRATACData)
data(HMMRATACData)

Sequence Alignment Using Bowtie

We begin with a BAM file containing pair-end ATAC-seq reads. Normally, this alignment is performed by a sequence aligner such as BOWTIE2 or BWA, but for the scope of this workflow, we will simply give an example of how to run bowtie on pairend count data like what can be obtained from the SRA link above. The vignette does not require that the reader performs this step. The ${SRR} in this example bach command corresponds to the name of a sample. Information on the use of bowtie can be found here

bowtie --chunkmbs 8000 -S -t hg19 -1 ${SRR}.1.fastq -2 ${SRR}.2.fastq ${SRR}.sam

Generate necessary files using RSamtools

The Rsamtools package provides a range of functionality to read and manipulate bam files. Rsamtools is the Bioconductor implementation of Samtools.

Using the BAM file that we just loaded, we will generate the remaining two files needed to run HMMRATAC using Rsamtools. Our next step is to create an index file. Rsamtools can do this with a sorted BAM file. We generate an index file as follow:

library(Rsamtools)

## Generate sorted bam file and save to a temporary location.
bam <- sortBam(file = bam, destination = tempfile())

## Create the index file.
index <- indexBam(file = bam)

The last file that we will generate is the genome.info file. This files shows

## Get list of headers.
genome <- scanBamHeader(file = bam)

## Extract targets and filter list
targets <- genome[[bam]]$targets
idx <- grepl("\\.[0-9]+$", names(targets))
genome_info <- targets[!idx]
genome_info <- data.frame(genome_info)

## Write table.
genome <- tempfile()
write.table(genome_info, genome, col.names = FALSE, quote = FALSE)

ATACseqQC

ATACseqQC is a package used for quickly assesing the quality of ATAC-seq data. It specializes in the use of

After performing the initial alignment, the quality of the alignment can be assesed using ATACseqQC

library(ATACseqQC)
estimateLibComplexity(readsDupFreq(bam))
fragSize <- fragSizeDist(bam, "Frag Sizes")

Footprint identification can also be performed with ATACseqQC.

library(BSgenome.Hsapiens.UCSC.hg19)
library(MotifDb)
CTCF <- query(MotifDb, c("CTCF"))
CTCF <- as.list(CTCF)
seqlev <- "chr1"

sigs <- factorFootprints(bam, pfm=CTCF[[1]], 
                         genome=Hsapiens,
                         min.score="90%", seqlev=seqlev,
                         upstream=100, downstream=100)

featureAlignedHeatmap(sigs$signal, 
                      feature.gr=reCenterPeaks(sigs$bindingSites,
                                               width=200+width(sigs$bindingSites[1])), 
                      annoMcols="score",
                      sortBy="score",
                      n.tile=ncol(sigs$signal[[1]]))

Using HMMRATAC for Peak Calling

HMMRATAC requires a large allocation of memory to run that rJava is not able to initially allocate to it. The amount of memory needed varies by the size of the data being modeled. For this example, we allocate 8GB of memory before loading the HMMRATAC package:

options(java.parameters = "-Xmx8000m")
library(HMMRATAC)

Using the original bam file as well as the index and genome info files that we generated, we can run the HMMRATAC package's main function.

HMMRATAC(bam = bam, index = index, genome = genome)

HMMRATAC output files

The following files are generated in an output folder from a run of HMMRATAC:

Name.log

This is a log file generated as HMMRATAC runs. It contains all inputted arguments, the results of the fragment distribution EM algorithm, and various status updates as HMMRATAC progresses, as well as a textual description of the generated model. When finished, it also reports the total amount of actual (real-world) time it took for HMMRATAC to run.

Name.bedgraph

This is the genome-wide state annotation file created if the --bedgraph option is set to true. It is a four column, tab-delimited file, with the following fields:

Name.model

This file contains the model that was generated by HMMRATAC and was used to decode the genome. It is always recommended to check this model after running HMMRATAC. Please note that this file is a binary file that can only be read by HMMRATAC. To see the actual textual description of the model, see the log file. If HMMRATAC runs to completion but produces poor results, it is usually the result of a faulty model. For instance, if the model shows “NaN” for all of the parameters, such as transition or emission probabilities, this usually means that something went wrong with either the choice of training regions or generating the signal tracks. This can occur with too low or too high of a fold-change range for choosing training regions, if the BED file of training regions was faulty or if certain signals need to be trimmed off (due to small numbers of larger fragments). Generally speaking, this poor model occurs when there is not enough separation in the signal tracks between the states. Common fixes include, trimming the signal tracks (option --trim), changing the number of states (option -k), choosing a different fold change range for training site identification (option -u and -l) or changing the list of previously annotated training regions (option -t). Additionally, checking the model ensures that the predicted model is created. Although rare, it is possible for a different state to be better indicative of the open state, than is normally the case (HMMRATAC sorts the model so the last or highest numbered state should be the open state). If this happens, then the resulting peak file may be incorrect. In such cases, it is recommended to report a bedgraph file and extract the proper state (of certain lengths and with the accompanying flanking regions) and manually create a peak BED file. We have tested numerous ATAC-seq datasets, using our default parameters and we generally don’t encounter problems, unless we radically change those settings.

Name_peaks.gappedPeak

This is the standard peak file reported by HMMRATAC by default (when the -p , --peaks option is set to true). The first line in the file is a track line for genome browsers. After that, it is a 15 column, tab-delimited file containg the following fields:

Name_summits.bed

This is a BED file containing the summits of each HMMRATAC peak. Summits are determined by finding the position within the open state region, whose Gaussian-smoothed read coverage is the maximum over the entire region. This is only outputted if a peak file is also outputted. If motif analysis is being conducted, it is recommended to use these summits as the points of interest, as the summits tend to be better indicative of TF binding as opposed to peak centers. This is a four column, tab-delimited BED file containing the following fields:

DiffBind

The DiffBind package specializes in identifying sites that are differentially bound between two sample groups. It also contains functionality for processing peak sets, counting sequencing reads overlapping interals in peak sets, and identifying statistically significant differentially bound sites based on evidence of binding affinity.

Here, we create a sampleSheet that describes the treatments and provide it as input.

library(DiffBind)

dba <- dba(sampleSheet='sample_sheet.csv')
count <- dba.count(dba, summits=500)
contrast <- dba.contrast(count, categories=DBA_CONDITION)
analyze <- dba.analyze(contrast)
bed <- dba.report(analyze)

TFBS analysis

In this section, we analysis the enrichment of transcrption factors in our dataset. First, the JASPAR2018 database is accessed to find pwm information for humans. Then a hypergeometric test if performed.

library(MotifDb)
library(TFBSTools)
library(JASPAR2018)
library("BSgenome.Hsapiens.UCSC.hg19")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
library(GenomicFeatures)

## Step 1: Match JASPAR TFs to chromosomes
opts <- list()
opts['species'] <- '9606'
pfm <- getMatrixSet(JASPAR2018, opts)
pwm <- toPWM(pfm)
#pwm <- pwm[1:10]

bs <- BSgenome.Hsapiens.UCSC.hg19
chrs <- lapply(1:22, function(i) {
    bs[[paste0('chr', i)]]
})

pro <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene)

## Step 2: Perform ORA to find signficant TFBS's
run_seq3 <- function(seqname, motif_id, pwms, bsgenome, peaks, promoters) {
    message("Motif ID: ", motif_id, " -- Seqname: ", seqname)
    ## Find binding sites on chromosome
    siteseq <- searchSeq(pwms[motif_id], bsgenome[[seqname]], seqname = seqname, min.score="80%", strand = "*")
    tfbs <- as(siteseq, "GRanges")
    promoters_chr <- promoters[seqnames(promoters) == seqname]
    peaks_chr <- peaks[seqnames(peaks) == seqname]

    ## Get promoters with TF
    idx <- promoters_chr %over% tfbs
    promoters_with_tf <- promoters_chr[idx]
    promoters_without_tf <- promoters_chr[!idx]

    ## Summary statistics for phyper
    npromoters_with_tf <- sum(countOverlaps(promoters_chr, tfbs) > 0)
    npromoters_without_tf <- length(promoters_chr) - npromoters_with_tf

    npromoters_with_tf_in_open_chromatin <- sum(countOverlaps(peaks_chr, promoters_with_tf) > 0)
    npromoters_without_tf_in_open_chromatin <- sum(countOverlaps(peaks_chr, promoters_without_tf) > 0)

    ## Return values
    c(q = npromoters_with_tf_in_open_chromatin,
      m = npromoters_with_tf,
      n = npromoters_without_tf,
      k = npromoters_with_tf_in_open_chromatin + npromoters_without_tf_in_open_chromatin)
}

perform_run_by_motif <- function(motif_id, pwm, bs, peaks, pro) {
    message('Running Motif: ', motif_id)
    seqnames <- paste0('chr', 1:22)
    motif_result <- lapply(seqnames, run_seq3, motif_id = motif_id, pwms = pwm, bsgenome = bs, peaks = peaks, promoters = pro)
    red <- Reduce(`+`, motif_result)

perform_run_by_motif <- function(motif_id, pwm, bs, peaks, pro) {
    message('Running Motif: ', motif_id)
    seqnames <- paste0('chr', 1:22)
    motif_result <- lapply(seqnames, run_seq3, motif_id = motif_id, pwms = pwm, bsgenome = bs, peaks = peaks, promoters = pro)
    red <- Reduce(`+`, motif_result)
    do.call(phyper, as.list(red))
}

run_motif_analysis <- function(peak_file_name, output_file_name) {
    motif_names <- names(pwm)
    motif_symbols <- vapply(pwm, function(motif) {motif@name}, character(1))
    total_results <- lapply(motif_names, perform_run_by_motif, pwm = pwm, bs = bs, peaks = peaks, pro = pro)
    total_results <- matrix(total_results)
    rownames(total_results) <- motif_symbols
    write.table(total_results, file = output_file_name, col.names = TRUE, quote=FALSE)
}

peak_LPS <- system.file("extdata", "sigs_LPS.bed", package="HMMRATAC")
peak_BG <- system.file("extdata", "sigs_BG.bed", package="HMMRATAC")
run_motif_analysis(peaks_LPS, "sig_motifs_LPS.txt")
run_motif_analysis(peals_BG, "sig_motifs_BG.txt")

The pre-computed results from the the HMMRATACData package are available. The significant transcription factors can be found as follows.

head(bg_motifs)
head(lps_motifs)

rGREAT functional analysis

rGREAT is a package that can read peak files and infer which pathways are enriched using functional enrichment analysis

library(rGREAT)
job <- submitGreatJob(bed)
tb <- getEnrichmentTables(job)
look <- tb[[1]]
res <- look[look$Hyper_Adjp_BH < 0.05,]

The pre-computed results from the the HMMRATACData package are available. The significant pathways can be found as follows.

head(bg_pathways)
head(lps_pathways)

ChIPseeker

The ChIPseeker package is meant to perform statistical testing of significant overlap amoung ChIP seq data sets. It also incorporates open access database GEO to allow users to compare their dataset to those availiable on GEO.

peak <- readPeakFile('65283_chr22_peaks.gappedPeak')

Display peak locations across entire genome

covplot(peak, weightCol="weights")

Perform peak annotation

peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")

plotAnnoPie(peakAnno)

upsetplot(peakAnno, vennpie=TRUE)

plotDistToTSS(peakAnno,
              title="Distribution of transcription factor-binding loci relative to TSS")

Session Info

sessionInfo()


Bioconductor/HMMRATAC documentation built on July 31, 2020, 3:07 p.m.