In fuscada2/PeakMapper: Flexible Mapping of Genomic Coordinates to Nearest Features

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  fig.width = 8, 
  fig.height = 6
)

Introduction

A common workflow in bioinformatics involves taking a set of genomic coordinates referred to as "peaks" (e.g. areas of protein enrichment detected by ChIP-seq) and annotating these peaks by mapping them onto nearby genomic features (e.g. genes or enhancer elements). PeakMapper allows users to map their peaks onto any set of genomic features that they provide, rather than relying on built-in sets of features. PeakMapper is also capable of mapping a single peak onto multiple different features in cases where one peak overlaps with multiple features. By accounting for the strandedness of provided features, PeakMapper is able to annotate peaks as being upstream or downstream of their closest feature(s). PeakMapper also provides several visualizations of generated peak-feature mappings.

Installation

The latest version of PeakMapper can be installed using the following commands:

require("devtools")
install_github("fuscada2/PeakMapper", build_vignettes = TRUE)
library(PeakMapper)

Motivating example

As an overview of what PeakMapper is capable of and how it's used, let's go through an example. In our example, we've done ChIP-seq to determine locations in the C. elegans genome that carry the histone modification H3K27me3. We also have a list of all the protein-coding genes in the C. elegans genome, which we've obtained from release WS263 of the WormBase database. We'd like to know which of these genes are likely to be regulated by H3K27me3, so that we can perform downstream analyses using this set of genes. To do this, we can map each of our H3K27me3 peaks onto its closest gene, or the genes that are overlapped by the peak. We will use PeakMapper to carry out this mapping.

Step 1: Importing data

As input, PeakMapper takes in files in standard BED format. The first 6 columns of these files must contain, in the following order:

Coordinate chromosome
Coordinate left endpoint
Coordinate right endpoint
Coordinate name
Coordinate score (e.g. quality score, or 0 if coordinates are not scored)
Coordinate strand (+ for forward strand, - for reverse strand, or . for unstranded)

Any columns after these 6 will be ignored by PeakMapper. BED files can be loaded as dataframes into R using the importBED() function, which takes in a file path name corresponding to the BED file. Here, we use the system.file() function to retrieve the file paths to our BED files of H3K27me3 peaks and C. elegans genes, which can be found in PeakMapper's extdata directory:

library(PeakMapper)
pathToPeaks <- system.file("extdata", "H3K27me3PeaksSmall.bed", package = "PeakMapper")
pathToGenes <- system.file("extdata", "WS263Genes.bed", package = "PeakMapper")

pathToPeaks
pathToGenes

We can then pass these file paths to importBED() to create dataframes for later use:

H3K27me3Peaks <- importBED(pathToPeaks)
WS263Genes <- importBED(pathToGenes)

head(H3K27me3Peaks)

head(WS263Genes)

Step 2: Mapping peaks to features

After importing our BED files as dataframes, we can use the mapPeaks() function to map each peak onto its closest feature (i.e. the feature that's the lowest number of base pairs away from the peak). If a peak overlaps with multiple features at the same time, all of those features are considered to be the closest features. mapPeaks() takes, as input, a dataframe of peak coordinates followed by a dataframe of feature coordinates. These coordinates must be in the format returned by importBED().

mappingResults <- mapPeaks(H3K27me3Peaks, WS263Genes)

head(mappingResults)

Each row in the resulting dataframe is a mapping between a single peak and a single feature closest to that peak. A peak will appear in this dataframe multiple times if it has multiple closest features (i.e. if it overlaps with more than 1 feature). The first 6 columns in the dataframe describe the peak, and the following 6 columns describe the closest feature. After that are 3 additional columns:

Peak Position: Whether the peak is upstream or downstream of the feature, and whether or not they overlap (features without strand information are assumed to be on the forward strand)
Peak Distance: The number of base pairs separating the closest endpoints of the peak and the feature. If the peak and feature overlap with each other, this distance is 0
Percent Overlap: The percentage of the feature's length that is overlapped by the peak (i.e. the percentage of feature base pairs contained within both coordinates)

Step 3: Visualizing results

After mapping peaks onto closest features, the resulting dataframe of peak-feature mappings can be used with PeakMapper's plotting functions to visualize aspects of the mapping. PeakMapper comes with 4 different visualization functions:

plotNumMappingsPerPeak()
plotPeakLocations()
plotOverlapDistribution()
plotPeakOrientations()

The plotNumMappingsPerPeak() function counts, for every peak in a supplied peak dataframe, how many closest features each peak was mapped onto (peaks will map onto multiple features in cases of multiple overlaps). These counts are returned by the function, and the function additionally plots the distribution of closest features per peak as a histogram. As input, plotNumMappingsPerPeak() requires a dataframe of peaks as returned by importBED(), and a dataframe of mapping results as returned by mapPeaks(). The peak dataframe must be the same dataframe used to produce the mapping results.

numClosestFeatures <- plotNumMappingsPerPeak(H3K27me3Peaks, mappingResults)
head(numClosestFeatures)

The plotPeakLocations() function takes as input a dataframe of mapping results as returned by mapPeaks(), and plots the distribution of peak locations (in base pairs) relative to the starting point (i.e. the first base at the upstream end, accounting for strandedness) of its closest feature(s). On the resulting plot, the position of each dot gives how many peaks overlap with the interval found that distance upstream or downstream of their closest feature's start point. By default, 2500 base pairs are plotted both upstream and downstream of the feature start point, each using 250 bins that are 10 base pairs wide each.

plotPeakLocations(mappingResults)

The user can manually change the width of the plotted interval by setting the number of bins upstream and downstream of the feature start point, as well as the width of each bin. The number of base pairs plotted upstream/downstream of the feature start point will be the number of bins times the width of each bin. Fewer bases per bin gives a higher resolution view of the distribution of peaks. For example, the following command creates a plot showing 5000 bp upstream and 3000 bp downstream of the feature start point.

plotPeakLocations(mappingResults, upstreamBins = 5000, downstreamBins = 3000, 
                  basesPerBin = 1)

The plotOverlapDistribution() function takes as input a dataframe of mapping results as returned by mapPeaks(), and plots the "Percent Overlap" column as a histogram. This gives the distribution of overlap percentages among all mapped peaks and their closest feature(s). If a peak has multiple closest features (i.e. it overlaps with multiple features), the overlap with each of these features will be plotted separately.

plotOverlapDistribution(mappingResults)

Finally, the plotPeakOrientations() function takes as input a dataframe of mapping results as returned by mapPeaks(), and plots the "Peak Position" column as a bar graph of different categories. This gives the distribution of the positions of mapped peaks, relative to their closest feature(s). The 6 qualitative position categories used here are:

"Downstream (no overlap)" - the peak is downstream of the feature but does not overlap it
"Overlap (downstream)" - the peak overlaps with the feature and also extends downstream of the feature, but not upstream
"Overlap (upstream and downstream)" - the feature is contained entirely within the peak
"Overlap (upstream)" - the peak overlaps with the feature and also extends upstream of the feature, but not downstream
"Overlap (within feature only)" - the peak is contained entirely with the feature
"Upstream (no overlap)" - the peak is upstream of the feature but does not overlap it

If a peak has multiple closest features (i.e. it overlaps with multiple features), its position relative to each of these features will be counted separately.

plotPeakOrientations(mappingResults)

All 4 of PeakMapper's plotting functions can take in additional optional parameters to customize the appearance of the resulting plots. The 5 parameters than can be set in all of these plotting functions are:

mainTitle - the title to appear at the top of the plot
xTitle - the title to appear on the x-axis
yTitle - the title to appear on the y-axis
plotColor - the color of the points or bars on the resulting plot
backgroundStyle - the appearance of the plot background - this must be one of either "grey" (for a grey-panelled background), "blackAndWhite" (for a white-panelled background), or "minimal" (for a background with no panelling).

An example of using these parameters to change the appearance of a plot produced by PeakMapper is shown below:

plotPeakLocations(mappingResults, 
                  mainTitle = "Relative peak positions of H3K27me3 peaks",
                  xTitle = "Relative position (bp)", 
                  yTitle = "# of peaks", 
                  plotColor = "blue", 
                  backgroundStyle = "grey")

Shiny app

PeakMapper also comes with a Shiny app that can be used to interactively run all of the functions outlined here. To launch the PeakMapper app, simply use the runPeakMapperApp() function:

runPeakMapperApp()

The Shiny app contains 3 different sections, each of which corresponds to one of the steps outlined here:

"Step 1: Import datasets" - this section uses the importBED() function to let users import their own data. Users can also preview their imported datasets.
"Step 2: Map peaks to features" - this section uses the mapPeaks() function to let users map imported peaks onto imported features. Users can view the resulting peak-feature mapping as a table, which can be exported as a CSV file
"Step 3: Visualize mapping results" - this section allows users to use PeakMapper's 4 plotting functions in order to visualize mapping results created in the previous step. These plots can be exported as PNGs

References

The peaks in the H3K27me3PeaksSmall.bed dataset were generated by Daniel Fusca using MACS2 (Yong Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137) with C. elegans ChIP-seq data provided by Arneet Saltzman (unpublished data). The genes in WS263Genes.bed were downloaded from the WS263 release of the WormBase database (https://www.wormbase.org) and filtered by Daniel Fusca to include only protein-coding genes.

fuscada2/PeakMapper documentation built on Dec. 8, 2019, 12:35 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com