#| label: knitr-opts #| echo: false knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" )
#| label: init #| echo: false #| message: false library(valr) library(dplyr) library(ggplot2) library(tibble)
The functions in valr
have similar names to their BEDtools
counterparts, and so will be familiar to users coming from the BEDtools
suite. Similar to pybedtools
, valr
has a terse syntax:
#| label: valr-demo #| message: false library(valr) library(dplyr) snps <- read_bed(valr_example("hg19.snps147.chr22.bed.gz")) genes <- read_bed(valr_example("genes.hg19.chr22.bed.gz")) # find snps in intergenic regions intergenic <- bed_subtract(snps, genes) # distance from intergenic snps to nearest gene nearby <- bed_closest(intergenic, genes) nearby |> select(starts_with("name"), .overlap, .dist) |> filter(abs(.dist) < 1000)
valr
assigns common column names to facilitate comparisons between tbls. All tbls will have chrom
, start
, and end
columns, and some tbls from multi-column formats will have additional pre-determined column names. See the read_bed()
documentation for details.
#| label: file-io bed_file <- valr_example("3fields.bed.gz") read_bed(bed_file) # accepts filepaths or URLs
valr
can also operate on BED-like data.frames already constructed in R, provided that columns named chrom
, start
and end
are present. New tbls can also be constructed as either tibbles
or base R data.frames
.
#| label: trbl-ivls bed <- tribble( ~chrom, ~start, ~end, "chr1", 1657492, 2657492, "chr2", 2501324, 3094650 ) bed
valr
adheres to the BED format which specifies that the start position for an interval is zero based and the end position is one-based. The first position in a chromosome is 0. The end position for a chromosome is one position passed the last base, and is not included in the interval. For example:
#| label: zero-based # a chromosome 100 basepairs in length chrom <- tribble( ~chrom, ~start, ~end, "chr1", 0, 100 ) chrom # single base-pair intervals bases <- tribble( ~chrom, ~start, ~end, "chr1", 0, 1, # first base of chromosome "chr1", 1, 2, # second base of chromosome "chr1", 99, 100 # last base of chromosome ) bases
Remote databases can be accessed with db_ucsc()
(to access the UCSC Browser) and db_ensembl()
(to access Ensembl databases).
#| label: db #| eval: false # access the `refGene` tbl on the `hg38` assembly. if (require(RMariaDB)) { ucsc <- db_ucsc("hg38") tbl(ucsc, "refGene") }
The bed_glyph()
tool illustrates the results of operations in valr
, similar to those found in the BEDtools
documentation. This glyph shows the result of intersecting x
and y
intervals with bed_intersect()
:
#| label: intersect-glyph x <- tribble( ~chrom, ~start, ~end, "chr1", 25, 50, "chr1", 100, 125 ) y <- tribble( ~chrom, ~start, ~end, "chr1", 30, 75 ) bed_glyph(bed_intersect(x, y))
And this glyph illustrates bed_merge()
:
#| label: merge-glyph x <- tribble( ~chrom, ~start, ~end, "chr1", 1, 50, "chr1", 10, 75, "chr1", 100, 120 ) bed_glyph(bed_merge(x))
The group_by
function in dplyr can be used to perform functions on subsets of single and multiple data_frame
s. Functions in valr
leverage grouping to enable a variety of comparisons. For example, intervals can be grouped by strand
to perform comparisons among intervals on the same strand.
#| label: group-strand x <- tribble( ~chrom, ~start, ~end, ~strand, "chr1", 1, 100, "+", "chr1", 50, 150, "+", "chr2", 100, 200, "-" ) y <- tribble( ~chrom, ~start, ~end, ~strand, "chr1", 50, 125, "+", "chr1", 50, 150, "-", "chr2", 50, 150, "+" ) # intersect tbls by strand x <- group_by(x, strand) y <- group_by(y, strand) bed_intersect(x, y)
Comparisons between intervals on opposite strands are done using the flip_strands()
function:
#| label: strand-opp x <- group_by(x, strand) y <- flip_strands(y) y <- group_by(y, strand) bed_intersect(x, y)
Both single set (e.g. bed_merge()
) and multi set operations will respect groupings in the input intervals.
Columns in BEDtools
are referred to by position:
# calculate the mean of column 6 for intervals in `b` that overlap with `a` bedtools map -a a.bed -b b.bed -c 6 -o mean
In valr
, columns are referred to by name and can be used in multiple name/value expressions for summaries.
#| label: tidy-eval #| eval: false # calculate the mean and variance for a `value` column bed_map(a, b, .mean = mean(value), .var = var(value)) # report concatenated and max values for merged intervals bed_merge(a, .concat = concat(value), .max = max(value))
This demonstration illustrates how to use valr
tools to perform a "meta-analysis" of signals relative to genomic features. Here we to analyze the distribution of histone marks surrounding transcription start sites.
First we load libraries and relevant data.
#| label: tss-demo #| warning: false #| message: false # `valr_example()` identifies the path of example files bedfile <- valr_example("genes.hg19.chr22.bed.gz") genomefile <- valr_example("hg19.chrom.sizes.gz") bgfile <- valr_example("hela.h3k4.chip.bg.gz") genes <- read_bed(bedfile) genome <- read_genome(genomefile) y <- read_bedgraph(bgfile)
Then we generate 1 bp intervals to represent transcription start sites (TSSs). We focus on +
strand genes, but -
genes are easily accommodated by filtering them and using bed_makewindows()
with reversed
window numbers.
#| label: make-tss # generate 1 bp TSS intervals, `+` strand only tss <- genes |> filter(strand == "+") |> mutate(end = start + 1) # 1000 bp up and downstream region_size <- 1000 # 50 bp windows win_size <- 50 # add slop to the TSS, break into windows and add a group x <- tss |> bed_slop(genome, both = region_size) |> bed_makewindows(win_size) x
Now we use the .win_id
group with bed_map()
to calculate a sum by mapping y
signals onto the intervals in x
. These data are regrouped by .win_id
and a summary with mean
and sd
values is calculated.
#| label: bed-map # map signals to TSS regions and calculate summary statistics. res <- bed_map(x, y, win_sum = sum(value, na.rm = TRUE)) |> group_by(.win_id) |> summarize( win_mean = mean(win_sum, na.rm = TRUE), win_sd = sd(win_sum, na.rm = TRUE) ) res
Finally, these summary statistics are used to construct a plot that illustrates histone density surrounding TSSs.
#| lable: plot-tss #| warning: false #| message: false x_labels <- seq( -region_size, region_size, by = win_size * 5 ) x_breaks <- seq(1, 41, by = 5) sd_limits <- aes( ymax = win_mean + win_sd, ymin = win_mean - win_sd ) ggplot( res, aes( x = .win_id, y = win_mean ) ) + geom_point() + geom_pointrange(sd_limits) + scale_x_continuous( labels = x_labels, breaks = x_breaks ) + labs( x = "Position (bp from TSS)", y = "Signal", title = "Human H3K4me3 signal near transcription start sites" ) + theme_classic()
The Python library pybedtools wraps BEDtools.
The R packages GenomicRanges, bedr, IRanges and GenometriCorr provide similar capability with a different philosophy.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.