require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
This package contains several ChIP-seq datasets for use in differential binding (DB) analyses:
These datasets are mainly used in the r Biocpkg("chipseqDB") workflow [@lun2015reads] and the r Biocpkg("csaw") user's guide [@lun2016csaw].
This vignette will briefly demonstrate how to obtain each dataset and investigate some of the processing statistics.
We obtain the H3K9ac dataset from r Biocpkg("ExperimentHub") using the H3K9acData() function.
This downloads sorted and indexed BAM files to a local cache, along with the associated index files.
The function returns a DataFrame of file paths and sample descriptions to further use in workflows.
library(chipseqDBData) h3k9ac.paths <- H3K9acData() h3k9ac.paths
Note that the time-consuming download only occurs upon the first use of the function. Later uses will simply re-use the same files, thus avoiding the need to re-download these large files. (Some readers may notice that the paths point to the temporary directory, which is destroyed at the end of each R session. Here, the temporary directory contains only soft-links to the persistent BAM files in the local cache. This is a low-cost illusion to ensure that the index files have the same prefixes as the BAM files.)
The same approach is used for all of the other datasets, e.g., CBPData(), NFYAData().
Be aware that the initial download time will depend on the size and number of the BAM files in each dataset.
We use functions from the r Biocpkg("Rsamtools") package to examine the mapping statistics.
This includes the number of mapped reads, the number of marked reads (i.e., potential PCR duplicates) and the number of high-quality alignments with high mapping scores.
library(Rsamtools) diagnostics <- list() for (i in seq_len(nrow(h3k9ac.paths))) { stats <- scanBam(h3k9ac.paths$Path[[i]], param=ScanBamParam(what=c("mapq", "flag"))) flag <- stats[[1]]$flag mapq <- stats[[1]]$mapq mapped <- bitwAnd(flag, 0x4)==0 diagnostics[[h3k9ac.paths$Name[i]]] <- c( Total=length(flag), Mapped=sum(mapped), HighQual=sum(mapq >= 10 & mapped), DupMarked=sum(bitwAnd(flag, 0x400)!=0) ) } diag.stats <- data.frame(do.call(rbind, diagnostics)) diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100 diag.stats$Prop.marked <- diag.stats$DupMarked/diag.stats$Mapped*100 diag.stats
More comprehensive quality checks are beyond the scope of this document, but can be performed with other packages such as r Biocpkg("ChIPQC").
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.