megadepth quick start guide

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    megadepth = citation("megadepth")[1],
    megadepthpaper = citation("megadepth")[2],
    bioconductor2015 = RefManageR::BibEntry(
        bibtype = "Article",
        key = "bioconductor2015",
        author = "Wolfgang Huber and Vincent J Carey and Robert Gentleman and Simon Anders and Marc Carlson and Benilton S Carvalho and Hector Corrada Bravo and Sean Davis and Laurent Gatto and Thomas Girke and Raphael Gottardo and Florian Hahne and Kasper D Hansen and Rafael A Irizarry and Michael Lawrence and Michael I Love and James MacDonald and Valerie Obenchain and Andrzej K Oleś and Hervé Pagès and Alejandro Reyes and Paul Shannon and Gordon K Smyth and Dan Tenenbaum and Levi Waldron and Martin Morgan",
        title = "Orchestrating high-throughput genomic analysis with Bioconductor",
        year = 2015, doi = "10.1038/nmeth.3252",
        journal = "Nature Methods",
        journaltitle = "Nat Methods"
    ),
    RefManageR = citation("RefManageR")[1],
    rtracklayer = citation("rtracklayer")
)

The goal of r Biocpkg("megadepth") is to provide an R interface to the command line tool Megadepth for BigWig and BAM related utilities created by Christopher Wilks r Citep(bib[["megadepthpaper"]]). This R package enables fast processing of BigWig files on downstream packages such as dasper and recount3. The Megadepth software also provides utilities for processing BAM files and extracting coverage information from them.

Here is an illustration on how fast r Biocpkg("megadepth") is compared to other tools for processing local and remote BigWig files.

knitr::include_graphics("https://raw.githubusercontent.com/LieberInstitute/megadepth/master/analysis/md_rt_pybw_runtime.png")

Throughout the documentation we use a capital M to refer to the software by Christopher Wilks and a lower case m to refer to this R/Bioconductor package.

Basics

Install megadepth

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. r Biocpkg("megadepth") is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install r Biocpkg("megadepth") by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("megadepth")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

Required knowledge

r Biocpkg("megadepth") is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data and high-throughput sequencing data in general. You might benefit from being familiar with the BigWig file format and the r Biocpkg("rtracklayer") for importing those files into R as well as exporting BED files r Citep(bib[["rtracklayer"]]). If you are working with annoation files, r Biocpkg("GenomicFeatures") and r Biocpkg("GenomicRanges") will also be useful to you.

If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in this blog post.

Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the megadepth tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

Citing megadepth

We hope that r Biocpkg("megadepth") will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("megadepth")

Quick start

To get started, we need to load the r Biocpkg("megadepth") package into our R session. This will load all the required dependencies.

library("megadepth")

Once we have the R package loaded, we need to install the Megadepth software. We can do so with install_megadepth(), which downloads a binary for your OS (Linux, Windows or macOS) ^[Please check Megadepth for instructions on how to compile the software from source if the binary version doesn't work for you.]. We can then use with an example BigWig file to compute the coverage at a set of regions.

## Install the latest version of Megadepth
install_megadepth(force = TRUE)

Next, we might want to use r Biocpkg("megadepth") to quantify the coverage at a set of regions of the genome of interest to us. Here we will use two example files that are include in the package for illustration and testing purposes. One of them is a bigWig file that contains the base-pair coverage information for a sample of interest and the second one is BED file which contains the genomic region coordinates of interest. So we first locate them.

## Next, we locate the example BigWig and annotation files
example_bw <- system.file("tests", "test.bam.all.bw",
    package = "megadepth", mustWork = TRUE
)
annotation_file <- system.file("tests", "testbw2.bed",
    package = "megadepth", mustWork = TRUE
)

## Where are they?
example_bw
annotation_file

Once we have located the example files we can proceed to calculating the base-pair coverage for our genomic regions of interest. There are several ways to do this with r Biocpkg("megadepth"), but here we use the user-friendly function get_coverage(). This function will perform a given operation op on the bigWig file for a given set of regions of interest (annotation). One of those operations is to compute the mean base-pair coverage for each input region. This is what we'll do with our example bigWig file.

## We can then use megadepth to compute the coverage
bw_cov <- get_coverage(
    example_bw,
    op = "mean",
    annotation = annotation_file
)
bw_cov

get_coverage() returns an object that is familiar to r Biocpkg("GenomicRanges") users, that is, a GRanges object that can be used with other Bioconductor software packages r Citep(bib[["bioconductor2015"]]).

This example is just the tip of the iceberg, as Megadepth and thus r Biocpkg("megadepth") can do a lot of useful processing operations on BAM and bigWig files.

Users guide

Command interface

Megadepth is very powerful and can do a lot of different things. The R/Bioconductor package provides two functions for interfacing with Megadepth, megadepth_cmd() and megadepth_shell(). For the first one, megadepth_cmd(), you need to know the actual command syntax you want to use and format it accordingly. If you are more comfortable with R functions, megadepth_shell() uses r BiocStyle::CRANpkg("cmdfun") to power this interface and capture the standard output stream into R.

To make it easier to use, megadepth includes functions that simplify the number of arguments, read in the output files, and converts them into R/Bioconductor friendly objects, such as get_coverage() illustrated previously in the quick start section.

We hope that you'll find megadepth and Megadepth useful for your work. If you are interested in checking how fast megadepth is, check out the speed analysis comparison against other tools. Note that the size of the files used and the number of genomic regions queried will affect the speed comparisons.

## R-like interface
## that captures the standard output into R
head(megadepth_shell(help = TRUE))

## Command-like interface
megadepth_cmd("--help")
x <- megadepth_shell(help = TRUE)
cat(paste0(x, "\n"))

BAM to bigWig

One use case of Megadepth is to convert BAM files to bigWig coverage files. To simplify this process and verify that you are not accidentally overwriting valuable files, r Biocpkg("megadepth") provides the function bam_to_bigwig(). To illustrate this functionality, we first locate an example BAM file. We then generate the output bigWig files

## Find the example BAM file
example_bam <- system.file("tests", "test.bam",
    package = "megadepth", mustWork = TRUE
)

## Create the BigWig file
## Currently Megadepth does not support this on Windows
example_bw <- bam_to_bigwig(example_bam, overwrite = TRUE)

## Path to the output files generated by bam_to_bigwig()
example_bw

Currently this functionality does not work on Windows. In which case, you can continue the vignette with the following example bigWig file.

## On Windows, use the example bigWig file that is already included in
## the R package
example_bw <- system.file("tests", "test.bam.all.bw",
    package = "megadepth", mustWork = TRUE
)

Summarize coverage over regions

Once you have a biWig file, you might want to quantify the mean or total expression across a set of genomic coordinates. bigWig files are typically used by genome browsers, and as part of the r Biocpkg('recount') and r Biocpkg('recount3') projects we have released thousands of them. Before we expand more complex uses cases, you might be interested in get_coverage(). This function will use Megadepth to create a tab-separated value (TSV) file containing the coverage summary information ^[Sum, mean, min or max base-pair coverage for each region.] for a given input file that can be read into R using read_coverage() ^[If you prefer a tibble, use read_coverage_table().] as shown below with an example set of genomic regions of interest (annotation).

## Next, we locate the example annotation BED file
annotation_file <- system.file("tests", "testbw2.bed",
    package = "megadepth", mustWork = TRUE
)

## Now we can compute the coverage
bw_cov <- get_coverage(example_bw, op = "mean", annotation = annotation_file)
bw_cov

If you are familiar with r Biocpkg("rtracklayer"), you'll notice that the coverage summaries are basically the same to the one that can be generated with rtracklayer::import.bw(), which is what r Biocpkg("derfinder") uses internally.

## Checking with derfinder and rtracklayer
bed <- rtracklayer::import(annotation_file)

## The file needs a name
names(example_bw) <- "example"

## Read in the base-pair coverage data
regionCov <- derfinder::getRegionCoverage(
    regions = bed,
    files = example_bw,
    verbose = FALSE
)

## Summarize the base-pair coverage data.
## Note that we have to round the mean to make them comparable.
testthat::expect_equivalent(
    round(sapply(regionCov[c(1, 3:4, 2)], function(x) mean(x$value)), 2),
    bw_cov$score,
)

## If we compute the sum, there's no need to round
testthat::expect_equivalent(
    sapply(regionCov[c(1, 3:4, 2)], function(x) sum(x$value)),
    get_coverage(example_bw, op = "sum", annotation = annotation_file)$score,
)

BAM to junctions

Megadepth provides utilities that might be of use for future work or that were developed for building r Biocpkg('recount3'). One of these features is the possibility to extract locally co-ocurring junctions from a BAM file as described in the Megadepth documentation. This feature works only for junctions for which a read or (read pair) has 2 or more junctions.

To illustrate this functionality, we will use an example BAM file and generate the locally co-occurring junction table with bam_to_junctions(). We'll then read in the data using read_junction_table() ^[The strand columns have been switched from 0s and 1s to + and - for the forard and reverse strands, to match frequently used Bioconductor packages.].

## Find the example BAM file
example_bam <- system.file("tests", "test.bam",
    package = "megadepth", mustWork = TRUE
)

## Run bam_to_junctions()
example_jxs <- bam_to_junctions(example_bam, overwrite = TRUE)

## Path to the output file generated by bam_to_junctions()
example_jxs

## Read the data as a tibble using the format specified at
## https://github.com/ChristopherWilks/megadepth#megadepth-pathtobamfile---junctions
read_junction_table(example_jxs)

Teams involved

r Biocpkg('megadepth') was made possible to David Zhang, the author of r Biocpkg("dasper"), and a member of the Mina Ryten's lab at UCL.

The ReCount family involves the following teams:

Other related tools

The ReCount team has worked on several software solutions and projects that complement each other and enable you to re-use public RNA-seq data. Another Bioconductor package that you might be highly interested in is r Biocpkg("snapcount"), which allows you to use the Snaptron web services. In particular, r Biocpkg("snapcount") is best for queries over a particular subset of genes or intervals across all or most of the samples in recount2/Snaptron.

We remind you that the main documentation website for all the recount3-related projects is available at recount.bio. Please check that website for more information about how this R/Bioconductor package and other tools are related to each other.

Thank you!

P.S. An alternative version of this vignette is available that was made using r CRANpkg("pkgdown").

Reproducibility

The r Biocpkg("megadepth") package r Citep(bib[["megadepth"]]) was made possible thanks to:

This package was developed using r BiocStyle::Biocpkg("biocthis").

Code for creating the vignette

## Create the vignette
library("rmarkdown")
system.time(render("megadepth.Rmd", "BiocStyle::html_document"))

## Extract the R code
library("knitr")
knit("megadepth.Rmd", tangle = TRUE)

Date the vignette was generated.

## Date the vignette was generated
Sys.time()

Wallclock time spent generating the vignette.

## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)

R session information.

## Session info
library("sessioninfo")
options(width = 120)
session_info()

Bibliography

This vignette was generated using r Biocpkg("BiocStyle") r Citep(bib[["BiocStyle"]]) with r CRANpkg("knitr") r Citep(bib[["knitr"]]) and r CRANpkg("rmarkdown") r Citep(bib[["rmarkdown"]]) running behind the scenes.

Citations made with r CRANpkg('RefManageR') r Citep(bib[['RefManageR']]).

## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))


Try the megadepth package in your browser

Any scripts or data that you put into this service are public.

megadepth documentation built on Feb. 5, 2021, 2:03 a.m.