extract_sites: Extract subset of sites in a set of intervals
In jokergoo/epik: Integrative Analysis and Visualization of Epigenomic Sequencing Data

Description Usage Arguments Details Value Author(s) Examples

Extract subset of sites in a set of intervals

1	extract_sites(start, end, site, return_index = FALSE, min_sites = 0)

`start`	start positions, a single integer or a numeric vector
`end`	end positions, a single integer or a numeric vector
`site`	positions of all sites, a numeric vector, should be sorted increasingly.
`return_index`	whether return the index in the position vector or just the position itself?
`min_sites`	minimal number of sites in an interval, regions which contain sites less than this value will be filtered out.

Providing a huge vector of genomic positions, we want to extract subset of positions which locate in a specific group of regions (e.g. extract CpG sites in DMRs). The simplest way is to use:

  site = sort(sample(10000000, 1000000))
  start = 123456
  end = 654321
  subsite = site[site >= start & site <= end]

Unfortunately, in above code, the whole vector site will be scanned four times (>=, <=, & and [). If you want to look for sites in more than one regions (e.g. 1000 regions), in every loop, the whole site vector will be re-scanned again and again which is very time-consuming.

Here we have extract_sites function which uses binary search to do subsetting. Of course, site should be sorted non-decreasing beforehand.

1	subsite = extract_sites(start, end, site, index = FALSE)

Not only for single interval, you can also extract sites in multiple genomic regins, by setting start and end as vectors. E.g. extracting CpG sites in exons in every gene.

1
2
3

  start = c(123456, 234567, 345678)
  end = c(133456, 244567, 355678)
  subsite = extract_sites(start, end, site)

You can choose to return index only or positions.

  subsite = extract_sites(start, end, site, return_index = FALSE)
  head(subsite)
  subsite_index = extract_sites(start, end, site, return_index = TRUE)
  head(subsite_index)
  head(site[subsite_index])

Regions that include sites less than min_site will be filtered out.

This kind of overlapping can also be done by defining an IRanges or GRanges object and calling findOverlaps function. Following compares the running time for using findOverlaps and extract_sites:

  set.seed(123)
  site = sort(sample(1000, 100))
  pos = do.call("rbind", lapply(1:10, function(i) sort(sample(max(site), 2))))
  pos_ir = IRanges(pos[, 1], pos[, 2])
  site_ir = IRanges(site, site)
  library(microbenchmark)
  microbenchmark(f1 = {r1 = extract_sites(pos[, 1], pos[, 2], site)},
      f2 = {
      mtch = findOverlaps(pos_ir, site_ir)
      r2 = start(site_ir[unique(subjectHits(mtch))])
  })
  # Unit: microseconds
  #  expr      min        lq       mean    median       uq      max neval cld
  #    f1    6.052    8.3925   13.99696   15.5855   18.132   29.366   100  a
  #    f2 1105.678 1175.9220 1237.87162 1192.6070 1271.972 2054.595   100   b

A vector of positions or index.

Zuguang Gu <z.gu@dkfz.de>

1
2
3

site = sort(sample(1000, 100))
pos = do.call("rbind", lapply(1:10, function(i) sort(sample(max(site), 2))))
extract_sites(pos[, 1], pos[, 2], site)

jokergoo/epik documentation built on Sept. 28, 2019, 9:20 a.m.

jokergoo/epik index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jokergoo/epik
Integrative Analysis and Visualization of Epigenomic Sequencing Data

extract_sites: Extract subset of sites in a set of intervals
In jokergoo/epik: Integrative Analysis and Visualization of Epigenomic Sequencing Data

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to extract_sites in jokergoo/epik...

R Package Documentation

Browse R Packages

We want your feedback!

jokergoo/epik Integrative Analysis and Visualization of Epigenomic Sequencing Data

extract_sites: Extract subset of sites in a set of intervals In jokergoo/epik: Integrative Analysis and Visualization of Epigenomic Sequencing Data

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to extract_sites in jokergoo/epik...

R Package Documentation

Browse R Packages

We want your feedback!

jokergoo/epik
Integrative Analysis and Visualization of Epigenomic Sequencing Data

extract_sites: Extract subset of sites in a set of intervals
In jokergoo/epik: Integrative Analysis and Visualization of Epigenomic Sequencing Data