knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6 )
A common workflow in bioinformatics involves taking a set of genomic coordinates referred to as "peaks" (e.g. areas of protein enrichment detected by ChIP-seq) and annotating these peaks by mapping them onto nearby genomic features (e.g. genes or enhancer elements). PeakMapper allows users to map their peaks onto any set of genomic features that they provide, rather than relying on built-in sets of features. PeakMapper is also capable of mapping a single peak onto multiple different features in cases where one peak overlaps with multiple features. By accounting for the strandedness of provided features, PeakMapper is able to annotate peaks as being upstream or downstream of their closest feature(s). PeakMapper also provides several visualizations of generated peak-feature mappings.
The latest version of PeakMapper can be installed using the following commands:
require("devtools") install_github("fuscada2/PeakMapper", build_vignettes = TRUE) library(PeakMapper)
As an overview of what PeakMapper is capable of and how it's used, let's go through an example. In our example, we've done ChIP-seq to determine locations in the C. elegans genome that carry the histone modification H3K27me3. We also have a list of all the protein-coding genes in the C. elegans genome, which we've obtained from release WS263 of the WormBase database. We'd like to know which of these genes are likely to be regulated by H3K27me3, so that we can perform downstream analyses using this set of genes. To do this, we can map each of our H3K27me3 peaks onto its closest gene, or the genes that are overlapped by the peak. We will use PeakMapper to carry out this mapping.
As input, PeakMapper takes in files in standard BED format. The first 6 columns of these files must contain, in the following order:
Any columns after these 6 will be ignored by PeakMapper. BED files can be loaded
as dataframes into R using the importBED()
function, which takes in a file
path name corresponding to the BED file. Here, we use the system.file()
function to retrieve the file paths to our BED files of H3K27me3 peaks and C.
elegans genes, which can be found in PeakMapper's extdata directory:
library(PeakMapper) pathToPeaks <- system.file("extdata", "H3K27me3PeaksSmall.bed", package = "PeakMapper") pathToGenes <- system.file("extdata", "WS263Genes.bed", package = "PeakMapper") pathToPeaks pathToGenes
We can then pass these file paths to importBED()
to create dataframes for later use:
H3K27me3Peaks <- importBED(pathToPeaks) WS263Genes <- importBED(pathToGenes) head(H3K27me3Peaks) head(WS263Genes)
After importing our BED files as dataframes, we can use the mapPeaks()
function to map each peak onto its closest feature (i.e. the feature that's the
lowest number of base pairs away from the peak). If a peak overlaps with
multiple features at the same time, all of those features are considered to be
the closest features. mapPeaks()
takes, as input, a dataframe of peak
coordinates followed by a dataframe of feature coordinates. These coordinates
must be in the format returned by importBED()
.
mappingResults <- mapPeaks(H3K27me3Peaks, WS263Genes) head(mappingResults)
Each row in the resulting dataframe is a mapping between a single peak and a single feature closest to that peak. A peak will appear in this dataframe multiple times if it has multiple closest features (i.e. if it overlaps with more than 1 feature). The first 6 columns in the dataframe describe the peak, and the following 6 columns describe the closest feature. After that are 3 additional columns:
After mapping peaks onto closest features, the resulting dataframe of peak-feature mappings can be used with PeakMapper's plotting functions to visualize aspects of the mapping. PeakMapper comes with 4 different visualization functions:
plotNumMappingsPerPeak()
plotPeakLocations()
plotOverlapDistribution()
plotPeakOrientations()
The plotNumMappingsPerPeak()
function counts, for every peak in a supplied
peak dataframe, how many closest features each peak was mapped onto (peaks will
map onto multiple features in cases of multiple overlaps). These counts are
returned by the function, and the function additionally plots the distribution
of closest features per peak as a histogram. As input,
plotNumMappingsPerPeak()
requires a dataframe of peaks as returned by
importBED()
, and a dataframe of mapping results as returned by mapPeaks()
.
The peak dataframe must be the same dataframe used to produce the mapping
results.
numClosestFeatures <- plotNumMappingsPerPeak(H3K27me3Peaks, mappingResults) head(numClosestFeatures)
The plotPeakLocations()
function takes as input a dataframe of mapping results
as returned by mapPeaks()
, and plots the distribution of peak locations (in
base pairs) relative to the starting point (i.e. the first base at the upstream
end, accounting for strandedness) of its closest feature(s). On the resulting
plot, the position of each dot gives how many peaks overlap with the interval
found that distance upstream or downstream of their closest feature's start
point. By default, 2500 base pairs are plotted both upstream and downstream of
the feature start point, each using 250 bins that are 10 base pairs wide each.
plotPeakLocations(mappingResults)
The user can manually change the width of the plotted interval by setting the number of bins upstream and downstream of the feature start point, as well as the width of each bin. The number of base pairs plotted upstream/downstream of the feature start point will be the number of bins times the width of each bin. Fewer bases per bin gives a higher resolution view of the distribution of peaks. For example, the following command creates a plot showing 5000 bp upstream and 3000 bp downstream of the feature start point.
plotPeakLocations(mappingResults, upstreamBins = 5000, downstreamBins = 3000, basesPerBin = 1)
The plotOverlapDistribution()
function takes as input a dataframe of mapping
results as returned by mapPeaks()
, and plots the "Percent Overlap" column as a
histogram. This gives the distribution of overlap percentages among all mapped
peaks and their closest feature(s). If a peak has multiple closest features
(i.e. it overlaps with multiple features), the overlap with each of these
features will be plotted separately.
plotOverlapDistribution(mappingResults)
Finally, the plotPeakOrientations()
function takes as input a dataframe of mapping
results as returned by mapPeaks()
, and plots the "Peak Position" column as a
bar graph of different categories. This gives the distribution of the positions of
mapped peaks, relative to their closest feature(s). The 6 qualitative position
categories used here are:
If a peak has multiple closest features (i.e. it overlaps with multiple features), its position relative to each of these features will be counted separately.
plotPeakOrientations(mappingResults)
All 4 of PeakMapper's plotting functions can take in additional optional parameters to customize the appearance of the resulting plots. The 5 parameters than can be set in all of these plotting functions are:
mainTitle
- the title to appear at the top of the plotxTitle
- the title to appear on the x-axisyTitle
- the title to appear on the y-axisplotColor
- the color of the points or bars on the resulting plotbackgroundStyle
- the appearance of the plot background - this must be one
of either "grey" (for a grey-panelled background), "blackAndWhite" (for a
white-panelled background), or "minimal" (for a background with no panelling).An example of using these parameters to change the appearance of a plot produced by PeakMapper is shown below:
plotPeakLocations(mappingResults, mainTitle = "Relative peak positions of H3K27me3 peaks", xTitle = "Relative position (bp)", yTitle = "# of peaks", plotColor = "blue", backgroundStyle = "grey")
PeakMapper also comes with a Shiny app that can be used to interactively run
all of the functions outlined here. To launch the PeakMapper app, simply use
the runPeakMapperApp()
function:
runPeakMapperApp()
The Shiny app contains 3 different sections, each of which corresponds to one of the steps outlined here:
importBED()
function to let
users import their own data. Users can also preview their imported datasets.mapPeaks()
function to
let users map imported peaks onto imported features. Users can view the resulting
peak-feature mapping as a table, which can be exported as a CSV fileThe peaks in the H3K27me3PeaksSmall.bed dataset were generated by Daniel Fusca using MACS2 (Yong Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137) with C. elegans ChIP-seq data provided by Arneet Saltzman (unpublished data). The genes in WS263Genes.bed were downloaded from the WS263 release of the WormBase database (https://www.wormbase.org) and filtered by Daniel Fusca to include only protein-coding genes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.