The most dramatic impact on programming in R the last years was the development of the tidyverse by Hadley Wickham et al.
which, combined with the ingenious %>%
from magrittr, provides a uniform philosophy for handling data.
The genomics community has an alternative set of approaches, for which bioconductor and the
GenomicRanges package provide the basis. The GenomicRanges
and
the underlying IRanges
package provide a great set of methods for dealing with intervals as they typically encountered in genomics.
Unfortunately it is not always easy to combine those two worlds, many common operations in GenomicRanges
focus solely on the
ranges and loose the additional metadata columns. On the other hand the tidyverse
does not provide a unified set of methods
to do common set operations with intervals.
At least until recently, when the fuzzyjoin package was extended with the genome_join
method for combining genomic data stored in a data.frame
. It demonstrated that genomic data could appropriately be handled
with the tidy-philosophy.
The tidygenomics
package extends the limited set of methods provided by the fuzzyjoin
package for dealing with genomic
data. Its API is inspired by the very popular bedtools:
genome_intersect
genome_subtract
genome_join_closest
genome_cluster
genome_complement
genome_join
Provided by the fuzzyjoin packagelibrary(dplyr) library(tidygenomics)
Joins 2 data frames based on their genomic overlap. Unlike the genome_join
function it updates the boundaries to reflect
the overlap of the regions.
x1 <- data.frame(id = 1:4, chromosome = c("chr1", "chr1", "chr2", "chr2"), start = c(100, 200, 300, 400), end = c(150, 250, 350, 450)) x2 <- data.frame(id = 1:4, chromosome = c("chr1", "chr2", "chr2", "chr1"), start = c(140, 210, 400, 300), end = c(160, 240, 415, 320)) genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
Subtracts one data frame from the other. This can be used to split the x data frame into smaller areas.
x1 <- data.frame(id = 1:4, chromosome = c("chr1", "chr1", "chr2", "chr1"), start = c(100, 200, 300, 400), end = c(150, 250, 350, 450)) x2 <- data.frame(id = 1:4, chromosome = c("chr1", "chr2", "chr1", "chr1"), start = c(120, 210, 300, 400), end = c(125, 240, 320, 415)) genome_subtract(x1, x2, by=c("chromosome", "start", "end"))
Joins 2 data frames based on their genomic location. If no exact overlap is found the next closest interval is used.
x1 <- tibble(id = 1:4, chr = c("chr1", "chr1", "chr2", "chr3"), start = c(100, 200, 300, 400), end = c(150, 250, 350, 450)) x2 <- tibble(id = 1:4, chr = c("chr1", "chr1", "chr1", "chr2"), start = c(220, 210, 300, 400), end = c(225, 240, 320, 415)) genome_join_closest(x1, x2, by=c("chr", "start", "end"), distance_column_name="distance", mode="left")
Add a new column with the cluster if 2 intervals are overlapping or are within the max_distance
.
x1 <- data.frame(id = 1:4, bla=letters[1:4], chromosome = c("chr1", "chr1", "chr2", "chr1"), start = c(100, 120, 300, 260), end = c(150, 250, 350, 450)) genome_cluster(x1, by=c("chromosome", "start", "end")) genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)
Calculates the complement of a genomic region.
x1 <- data.frame(id = 1:4, chromosome = c("chr1", "chr1", "chr2", "chr1"), start = c(100, 200, 300, 400), end = c(150, 250, 350, 450)) genome_complement(x1, by=c("chromosome", "start", "end"))
Classical join function based on the overlap of the interval. Implemented and mainted in the fuzzyjoin package and documented here only for completeness.
x1 <- tibble(id = 1:4, chr = c("chr1", "chr1", "chr2", "chr3"), start = c(100, 200, 300, 400), end = c(150, 250, 350, 450)) x2 <- tibble(id = 1:4, chr = c("chr1", "chr1", "chr1", "chr2"), start = c(220, 210, 300, 400), end = c(225, 240, 320, 415)) fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="inner") fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="left") fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="anti")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.