require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
This package contains several ChIP-seq datasets for use in differential binding (DB) analyses:
These datasets are mainly used in the r Biocpkg("chipseqDB")
workflow [@lun2015reads] and the r Biocpkg("csaw")
user's guide [@lun2016csaw].
This vignette will briefly demonstrate how to obtain each dataset and investigate some of the processing statistics.
We obtain the H3K9ac dataset from r Biocpkg("ExperimentHub")
using the H3K9acData()
function.
This downloads sorted and indexed BAM files to a local cache, along with the associated index files.
The function returns a DataFrame
of file paths and sample descriptions to further use in workflows.
library(chipseqDBData) h3k9ac.paths <- H3K9acData() h3k9ac.paths
Note that the time-consuming download only occurs upon the first use of the function. Later uses will simply re-use the same files, thus avoiding the need to re-download these large files. (Some readers may notice that the paths point to the temporary directory, which is destroyed at the end of each R session. Here, the temporary directory contains only soft-links to the persistent BAM files in the local cache. This is a low-cost illusion to ensure that the index files have the same prefixes as the BAM files.)
The same approach is used for all of the other datasets, e.g., CBPData()
, NFYAData()
.
Be aware that the initial download time will depend on the size and number of the BAM files in each dataset.
We use functions from the r Biocpkg("Rsamtools")
package to examine the mapping statistics.
This includes the number of mapped reads, the number of marked reads (i.e., potential PCR duplicates) and the number of high-quality alignments with high mapping scores.
library(Rsamtools) diagnostics <- list() for (i in seq_len(nrow(h3k9ac.paths))) { stats <- scanBam(h3k9ac.paths$Path[[i]], param=ScanBamParam(what=c("mapq", "flag"))) flag <- stats[[1]]$flag mapq <- stats[[1]]$mapq mapped <- bitwAnd(flag, 0x4)==0 diagnostics[[h3k9ac.paths$Name[i]]] <- c( Total=length(flag), Mapped=sum(mapped), HighQual=sum(mapq >= 10 & mapped), DupMarked=sum(bitwAnd(flag, 0x400)!=0) ) } diag.stats <- data.frame(do.call(rbind, diagnostics)) diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100 diag.stats$Prop.marked <- diag.stats$DupMarked/diag.stats$Mapped*100 diag.stats
More comprehensive quality checks are beyond the scope of this document, but can be performed with other packages such as r Biocpkg("ChIPQC")
.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.