knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

bedtorch

License: MIT R-CMD-check Codecov test coverage Lifecycle: experimental CodeFactor

For full documentation, refer to the package documentation site.

Motivation

The goal of bedtorch is to provide a fast BED file manipulation tool suite native in R.

In bioinformatics studies, an important type of jobs is related to BED and BED-like file manipulation. To name a few example:

Many of such tasks can be done by some highly-optimized tools, such as bedtools and bedops. However, these are binary tools which only works in shell. Sometimes, especially for developers using R, it's desirable to invoke these tools in R codes.

One solution is using functions that invokes system commands (system2 or system). bedr, a popular package, is one good example. It's a bridge linking between R codes and shell tools. Under the hood, it writes data into the temporary directory, calls certain command line tools such as tabix, bedtools, or bedops, parses the output and re-generate a data frame in R space.

Generally this approach is very helpful. However, it's worth noting some disadvantages.

Another approach is using native R data types to represent BED-like data. R's native data.frame seems to be a good model, but in many scenarios, it's performance is poor. More importantly, it lacks many advanced features compared with GenomicRanges, dplyr and data.table.

GenomicRanges is widely used in bioinformatics community. Although it provides essential building blocks for many tasks, such as findOverlaps, reduce, etc., it doesn't provide high-level features as bedtools does. Using findOverlaps, reduce, and other functions provided by GenomicRanges, users may implement operations such as map, merge, unique intersect, but this still requires lots of coding.

It would be desirable to have an R package, which represent BED-like data by GenomicRanges or data.table objects, and provide native support for common BED manipulations, such as merge, intersect, filter, etc.

Details

bedtorch is an attempt for this. Under the hood, every BED dataset is represented as a GenomicRanges object (it also can represent the dataset as data.table, if preferred. But this is not recommended). Users can apply various operations on it, such as intersect, merge, etc.

In terms of the core computation, bedtorch's performance is comparable to bedtools. However, since no disk IO and data conversion is needed, users can directly manipulate the dataset in memory, therefore in practice, via bedtorch, many tasks can be done about one magnitude faster.

Additionally, by linking to htslib, bedtorch can write data frames directly to BGZIP-format files, and optionally create the tabix index. It can also directly load either local or remote BGZIP-format files at any particular genomic regions. This feature is done by htslib, therefore not requiring external command line tools such as tabix or bgzip. In the case of working with remote files, being capable of loading any arbitrary portion of the file will be very convenient.

BGZIP is a variant of gzip and is widely used in bioinformatics. It's compatible with gzip, with an important additional feature: allows a compressed file to be indexed for fast query. However, R is unaware of BGZIP. It considers all .gz files as gzip-compressed, thus doesn't take advantage of BED index, and cannot output BGZIP files.

For more details, refer to the package reference page.

Installation

# install.packages("devtools")
devtools::install_github("haizi-zh/bedtorch")

Example

Here are several basic examples which shows you how to solve a common problem:

Read a BED file from disk:

library(bedtorch)

## Load BED data
file_path <-
  system.file("extdata", "example2.bed.gz", package = "bedtorch")

bedtbl <- read_bed(file_path, range = "1:3001-4000")
bedtbl

Read a remote BGZIP BED file for a certain region:

# Load remote BGZIP files with tabix index specified
# Here we need to explicitly indicate `compression` as `bgzip` since we are
# using short URL for `tabix_index`, so that the function cannot guess
# compression type using the URL
read_bed(
  "https://git.io/JYATB",
  range = "22:20000001-30000001",
  tabix_index = "https://git.io/JYAkT",
  compression = "bgzip"
)

Write the previous data table to the temporary directory, and create the tabix index alongside:

write_bed(bedtbl,
          file_path = tempfile(fileext = ".bed.gz"),
          tabix_index = TRUE)

Merge intervals, and take the mean of score in each merged group:

Image by bedtools

Image by bedtools

operation = list(mean_score = list(on = "score1", func = mean))
merged <- merge_bed(bedtbl, operation = operation)
head(merged)

Find intersections between two datasets:

Image by bedtools

Image by bedtools

file_path1 <- system.file("extdata", "example_merge.bed", package = "bedtorch")
file_path2 <- system.file("extdata", "example_intersect_y.bed", package = "bedtorch")

tbl_x <- read_bed(file_path1, genome = "hs37-1kg")
tbl_y <- read_bed(file_path2, genome = "hs37-1kg")
intersect_bed(tbl_x, tbl_y)

Shuffle a BED data table across the genome:

Image by bedtools

Image by bedtools

shuffle_bed(tbl_x)

Calculate Jaccard statistics between the two BED data tables:

Image by bedtools

Image by bedtools

jaccard_bed(tbl_x, tbl_y)

Performance

The following is the result of a simple benchmark for three common BED manipulations: merge, intersect and map.

Benchmark platform information:

From the benchmark result, we can see that for all three tasks, bedtorch uses much less time than bedtools.

Note: this does not mean bedtools is indeed slower than bedtorch at performing the actual computation. Don't forget bedtools requires large amount of disk IO, while bedtorch does not.

See also

Many features are inspired by bedtools. Thus, it's helpful to get familiar with bedtool's documentation: https://bedtools.readthedocs.io/en/latest/index.html



haizi-zh/bedtorch documentation built on July 1, 2022, 10:40 a.m.