knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  fig.width = 8, 
  fig.height = 6
)

Introduction

A common workflow in bioinformatics involves taking a set of genomic coordinates referred to as "peaks" (e.g. areas of protein enrichment detected by ChIP-seq) and annotating these peaks by mapping them onto nearby genomic features (e.g. genes or enhancer elements). PeakMapper allows users to map their peaks onto any set of genomic features that they provide, rather than relying on built-in sets of features. PeakMapper is also capable of mapping a single peak onto multiple different features in cases where one peak overlaps with multiple features. By accounting for the strandedness of provided features, PeakMapper is able to annotate peaks as being upstream or downstream of their closest feature(s). PeakMapper also provides several visualizations of generated peak-feature mappings.

Installation

The latest version of PeakMapper can be installed using the following commands:

require("devtools")
install_github("fuscada2/PeakMapper", build_vignettes = TRUE)
library(PeakMapper)

Motivating example

As an overview of what PeakMapper is capable of and how it's used, let's go through an example. In our example, we've done ChIP-seq to determine locations in the C. elegans genome that carry the histone modification H3K27me3. We also have a list of all the protein-coding genes in the C. elegans genome, which we've obtained from release WS263 of the WormBase database. We'd like to know which of these genes are likely to be regulated by H3K27me3, so that we can perform downstream analyses using this set of genes. To do this, we can map each of our H3K27me3 peaks onto its closest gene, or the genes that are overlapped by the peak. We will use PeakMapper to carry out this mapping.

Step 1: Importing data

As input, PeakMapper takes in files in standard BED format. The first 6 columns of these files must contain, in the following order:

  1. Coordinate chromosome
  2. Coordinate left endpoint
  3. Coordinate right endpoint
  4. Coordinate name
  5. Coordinate score (e.g. quality score, or 0 if coordinates are not scored)
  6. Coordinate strand (+ for forward strand, - for reverse strand, or . for unstranded)

Any columns after these 6 will be ignored by PeakMapper. BED files can be loaded as dataframes into R using the importBED() function, which takes in a file path name corresponding to the BED file. Here, we use the system.file() function to retrieve the file paths to our BED files of H3K27me3 peaks and C. elegans genes, which can be found in PeakMapper's extdata directory:

library(PeakMapper)
pathToPeaks <- system.file("extdata", "H3K27me3PeaksSmall.bed", package = "PeakMapper")
pathToGenes <- system.file("extdata", "WS263Genes.bed", package = "PeakMapper")

pathToPeaks
pathToGenes

We can then pass these file paths to importBED() to create dataframes for later use:

H3K27me3Peaks <- importBED(pathToPeaks)
WS263Genes <- importBED(pathToGenes)

head(H3K27me3Peaks)

head(WS263Genes)

Step 2: Mapping peaks to features

After importing our BED files as dataframes, we can use the mapPeaks() function to map each peak onto its closest feature (i.e. the feature that's the lowest number of base pairs away from the peak). If a peak overlaps with multiple features at the same time, all of those features are considered to be the closest features. mapPeaks() takes, as input, a dataframe of peak coordinates followed by a dataframe of feature coordinates. These coordinates must be in the format returned by importBED().

mappingResults <- mapPeaks(H3K27me3Peaks, WS263Genes)

head(mappingResults)

Each row in the resulting dataframe is a mapping between a single peak and a single feature closest to that peak. A peak will appear in this dataframe multiple times if it has multiple closest features (i.e. if it overlaps with more than 1 feature). The first 6 columns in the dataframe describe the peak, and the following 6 columns describe the closest feature. After that are 3 additional columns:

Step 3: Visualizing results

After mapping peaks onto closest features, the resulting dataframe of peak-feature mappings can be used with PeakMapper's plotting functions to visualize aspects of the mapping. PeakMapper comes with 4 different visualization functions:

The plotNumMappingsPerPeak() function counts, for every peak in a supplied peak dataframe, how many closest features each peak was mapped onto (peaks will map onto multiple features in cases of multiple overlaps). These counts are returned by the function, and the function additionally plots the distribution of closest features per peak as a histogram. As input, plotNumMappingsPerPeak() requires a dataframe of peaks as returned by importBED(), and a dataframe of mapping results as returned by mapPeaks(). The peak dataframe must be the same dataframe used to produce the mapping results.

numClosestFeatures <- plotNumMappingsPerPeak(H3K27me3Peaks, mappingResults)
head(numClosestFeatures)

The plotPeakLocations() function takes as input a dataframe of mapping results as returned by mapPeaks(), and plots the distribution of peak locations (in base pairs) relative to the starting point (i.e. the first base at the upstream end, accounting for strandedness) of its closest feature(s). On the resulting plot, the position of each dot gives how many peaks overlap with the interval found that distance upstream or downstream of their closest feature's start point. By default, 2500 base pairs are plotted both upstream and downstream of the feature start point, each using 250 bins that are 10 base pairs wide each.

plotPeakLocations(mappingResults)

The user can manually change the width of the plotted interval by setting the number of bins upstream and downstream of the feature start point, as well as the width of each bin. The number of base pairs plotted upstream/downstream of the feature start point will be the number of bins times the width of each bin. Fewer bases per bin gives a higher resolution view of the distribution of peaks. For example, the following command creates a plot showing 5000 bp upstream and 3000 bp downstream of the feature start point.

plotPeakLocations(mappingResults, upstreamBins = 5000, downstreamBins = 3000, 
                  basesPerBin = 1)

The plotOverlapDistribution() function takes as input a dataframe of mapping results as returned by mapPeaks(), and plots the "Percent Overlap" column as a histogram. This gives the distribution of overlap percentages among all mapped peaks and their closest feature(s). If a peak has multiple closest features (i.e. it overlaps with multiple features), the overlap with each of these features will be plotted separately.

plotOverlapDistribution(mappingResults)

Finally, the plotPeakOrientations() function takes as input a dataframe of mapping results as returned by mapPeaks(), and plots the "Peak Position" column as a bar graph of different categories. This gives the distribution of the positions of mapped peaks, relative to their closest feature(s). The 6 qualitative position categories used here are:

If a peak has multiple closest features (i.e. it overlaps with multiple features), its position relative to each of these features will be counted separately.

plotPeakOrientations(mappingResults)

All 4 of PeakMapper's plotting functions can take in additional optional parameters to customize the appearance of the resulting plots. The 5 parameters than can be set in all of these plotting functions are:

An example of using these parameters to change the appearance of a plot produced by PeakMapper is shown below:

plotPeakLocations(mappingResults, 
                  mainTitle = "Relative peak positions of H3K27me3 peaks",
                  xTitle = "Relative position (bp)", 
                  yTitle = "# of peaks", 
                  plotColor = "blue", 
                  backgroundStyle = "grey")

Shiny app

PeakMapper also comes with a Shiny app that can be used to interactively run all of the functions outlined here. To launch the PeakMapper app, simply use the runPeakMapperApp() function:

runPeakMapperApp()

The Shiny app contains 3 different sections, each of which corresponds to one of the steps outlined here:

  1. "Step 1: Import datasets" - this section uses the importBED() function to let users import their own data. Users can also preview their imported datasets.
  2. "Step 2: Map peaks to features" - this section uses the mapPeaks() function to let users map imported peaks onto imported features. Users can view the resulting peak-feature mapping as a table, which can be exported as a CSV file
  3. "Step 3: Visualize mapping results" - this section allows users to use PeakMapper's 4 plotting functions in order to visualize mapping results created in the previous step. These plots can be exported as PNGs

References

The peaks in the H3K27me3PeaksSmall.bed dataset were generated by Daniel Fusca using MACS2 (Yong Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137) with C. elegans ChIP-seq data provided by Arneet Saltzman (unpublished data). The genes in WS263Genes.bed were downloaded from the WS263 release of the WormBase database (https://www.wormbase.org) and filtered by Daniel Fusca to include only protein-coding genes.



fuscada2/PeakMapper documentation built on Dec. 8, 2019, 12:35 p.m.