require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Introduction

This package contains several ChIP-seq datasets for use in differential binding (DB) analyses:

These datasets are mainly used in the r Biocpkg("chipseqDB") workflow [@lun2015reads] and the r Biocpkg("csaw") user's guide [@lun2016csaw]. This vignette will briefly demonstrate how to obtain each dataset and investigate some of the processing statistics.

Obtaining each dataset

We obtain the H3K9ac dataset from r Biocpkg("ExperimentHub") using the H3K9acData() function. This downloads sorted and indexed BAM files to a local cache, along with the associated index files. The function returns a DataFrame of file paths and sample descriptions to further use in workflows.

library(chipseqDBData)
h3k9ac.paths <- H3K9acData()
h3k9ac.paths

Note that the time-consuming download only occurs upon the first use of the function. Later uses will simply re-use the same files, thus avoiding the need to re-download these large files. (Some readers may notice that the paths point to the temporary directory, which is destroyed at the end of each R session. Here, the temporary directory contains only soft-links to the persistent BAM files in the local cache. This is a low-cost illusion to ensure that the index files have the same prefixes as the BAM files.)

The same approach is used for all of the other datasets, e.g., CBPData(), NFYAData(). Be aware that the initial download time will depend on the size and number of the BAM files in each dataset.

Investigating mapping statistics

We use functions from the r Biocpkg("Rsamtools") package to examine the mapping statistics. This includes the number of mapped reads, the number of marked reads (i.e., potential PCR duplicates) and the number of high-quality alignments with high mapping scores.

library(Rsamtools)
diagnostics <- list()
for (i in seq_len(nrow(h3k9ac.paths))) {
    stats <- scanBam(h3k9ac.paths$Path[[i]], 
        param=ScanBamParam(what=c("mapq", "flag")))
    flag <- stats[[1]]$flag
    mapq <- stats[[1]]$mapq

    mapped <- bitwAnd(flag, 0x4)==0
    diagnostics[[h3k9ac.paths$Name[i]]] <- c(
        Total=length(flag), 
        Mapped=sum(mapped),
        HighQual=sum(mapq >= 10 & mapped),
        DupMarked=sum(bitwAnd(flag, 0x400)!=0)
    )
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$DupMarked/diag.stats$Mapped*100
diag.stats

More comprehensive quality checks are beyond the scope of this document, but can be performed with other packages such as r Biocpkg("ChIPQC").

Session information

sessionInfo()

References



LTLA/chipseqDBData documentation built on July 3, 2021, 9:09 a.m.