extract_sites: Extract subset of sites in a set of intervals

Description Usage Arguments Details Value Author(s) Examples

Description

Extract subset of sites in a set of intervals

Usage

1
extract_sites(start, end, site, return_index = FALSE, min_sites = 0)

Arguments

start

start positions, a numeric vector

end

end positions, a numeric vector.

site

positions of all sites, should be sorted increasingly.

return_index

whether return the index in the position vector or just the position itself?

min_sites

minimal number of sites in an interval, regions which contain sites less than this value will be filtered out.

Details

Providing a huge vector of genomic positions, we want to extract subset of positions which locate in a specific group of regions (e.g. extract CpG sites in DMRs). Normally, we will use:

1
2
3
4
	site = sort(sample(10000000, 1000000))
	start = 123456
	end = 654321
	subsite = site[site >= start & site <= end]  

Unfortunately, in above code, the whole vector site will be scanned four times (>=, <=, & and [). If you want to look for sites in more than one regions (e.g. 1000 regions), in every loop, the whole site vector will be re-scanned again and again which is very time-consuming.

Here we have extract_sites function which uses binary search to do subsetting. Of course, site should be sorted non-decreasing beforehand.

1
  subsite = extract_sites(start, end, site, index = FALSE)  

Not only for single interval, you can also extract sites in multiple genomic regins, by setting start and end as vectors.

1
2
3
  start = c(123456, 234567, 345678)
  end = c(133456, 244567, 355678)
  subsite = extract_sites(start, end, site)  

You can choose to return index only or positions.

1
2
3
4
5
  subsite = extract_sites(start, end, site, return_index = FALSE)
  head(subsite)
  subsite_index = extract_sites(start, end, site, return_index = TRUE)
  head(subsite_index)
  head(site[subsite_index])  

Regions that include sites less than min_site will be filtered out.

Value

A vector of positions or index.

Author(s)

Zuguang Gu <z.gu@dkfz.de>

Examples

1
2
3
site = sort(sample(1000, 100))
pos = do.call("rbind", lapply(1:10, function(i) sort(sample(max(site), 2))))
extract_sites(pos[, 1], pos[, 2], site)

eilslabs/epic documentation built on May 16, 2019, 1:24 a.m.