Description Usage Arguments Details Value Examples
Extract subset of sites which are in a set of intervals
1 | extract_sites(start, end, site, index = TRUE, filter_fun = NULL)
|
start |
start position, a vector |
end |
end position, a vector. Note there should be no overlap between all |
site |
positions of all sites, should be sorted. |
index |
whether return the index in the position vector or just the position itself? |
filter_fun |
filter sites in the interval. If there are more than one intervals, sites in each interval will be filtered by |
Providing a huge vector of genomic positions, we want to extract subset of sites which locate in a specific region. Normally, we will use:
1 2 3 4 |
Unfortunately, in above code, the whole vector site
will be scaned for four times
(>=
, <=
, &
and [
).
If you want to look for sites in more than one regions (e.g. 1000 regions), in every
loop, the whole site
vector will be re-scanned again and again which is very time-consuming.
Here we have extract_sites
function which uses binary search to do subsetting.
Of course, site
should be sorted non-decreasing in the first place.
1 | subsite = extract_sites(start, end, site, index = FALSE)
|
Not only for single interval, you can also extract sites in a list of genomic regins,
by setting start
and end
as a vector.
1 2 3 |
filter_fun
can filter sites in each interval by looking at the number of sites in it.
Following code filters out intervals that have less than 10 sites.
1 | subsite = extract_sites(start, end, site, filter_fun = function(x) length(x) >= 10)
|
You can choose to return index only or positions.
1 2 3 4 5 |
Follows compare the normal subsetting and binary search.
We first extract all the index that we want to test so that it would not affect the comparison.
In following code, we explicitly transform site
and pos
to double
mode because binary search
is implemented in C-level, and it expects parameter stored in double
mode.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | pos = do.call("rbind", lapply(1:1000, function(i) sort(sample(max(site), 2))))
head(pos)
t1 = system.time(for(i in 1:1000) {
which(site >= pos[i, 1] & site <= pos[i, 2])
})
t1
site2 = as.double(site)
pos2 = as.double(pos)
dim(pos2) = dim(pos)
t2 = system.time(for(i in 1:1000) {
extract_sites(pos2[i, 1], pos2[i, 2], site2)
})
t2
|
a vector of positions or index. If there is no sites in the interval, it will return NULL
.
1 2 | # There is no example
NULL
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.