Description Usage Arguments Details Value Author(s) Examples
View source: R/Rcpp_interface.R
Extract subset of sites in a set of intervals
1 | extract_sites(start, end, site, return_index = FALSE, min_sites = 0)
|
start |
start positions, a single integer or a numeric vector |
end |
end positions, a single integer or a numeric vector |
site |
positions of all sites, a numeric vector, should be sorted increasingly. |
return_index |
whether return the index in the position vector or just the position itself? |
min_sites |
minimal number of sites in an interval, regions which contain sites less than this value will be filtered out. |
Providing a huge vector of genomic positions, we want to extract subset of positions which locate in a specific group of regions (e.g. extract CpG sites in DMRs). The simplest way is to use:
1 2 3 4 |
Unfortunately, in above code, the whole vector site
will be scanned four times
(>=
, <=
, &
and [
).
If you want to look for sites in more than one regions (e.g. 1000 regions), in every
loop, the whole site
vector will be re-scanned again and again which is very time-consuming.
Here we have extract_sites
function which uses binary search to do subsetting.
Of course, site
should be sorted non-decreasing beforehand.
1 | subsite = extract_sites(start, end, site, index = FALSE)
|
Not only for single interval, you can also extract sites in multiple genomic regins,
by setting start
and end
as vectors. E.g. extracting CpG sites in exons in every gene.
1 2 3 |
You can choose to return index only or positions.
1 2 3 4 5 |
Regions that include sites less than min_site
will be filtered out.
This kind of overlapping can also be done by defining an IRanges
or GRanges
object and calling findOverlaps
function. Following compares the running time for using findOverlaps
and extract_sites
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | set.seed(123)
site = sort(sample(1000, 100))
pos = do.call("rbind", lapply(1:10, function(i) sort(sample(max(site), 2))))
pos_ir = IRanges(pos[, 1], pos[, 2])
site_ir = IRanges(site, site)
library(microbenchmark)
microbenchmark(f1 = {r1 = extract_sites(pos[, 1], pos[, 2], site)},
f2 = {
mtch = findOverlaps(pos_ir, site_ir)
r2 = start(site_ir[unique(subjectHits(mtch))])
})
# Unit: microseconds
# expr min lq mean median uq max neval cld
# f1 6.052 8.3925 13.99696 15.5855 18.132 29.366 100 a
# f2 1105.678 1175.9220 1237.87162 1192.6070 1271.972 2054.595 100 b
|
A vector of positions or index.
Zuguang Gu <z.gu@dkfz.de>
1 2 3 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.