Fast overlap joins
Description
A fast binarysearch based overlap join of two data.table
s.
This is very much inspired by findOverlaps
function from the Bioconductor
package IRanges
(see link below under See Also
).
Usually, x
is a very large data.table with small interval ranges, and
y
is much smaller keyed data.table
with relatively larger
interval spans. For a usage in genomics
, see the examples section.
NOTE: This is still under development, meaning it's stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.
Usage
1 2 3 4 5 6 
Arguments
x, y 

by.x, by.y 
A vector of column names (or numbers) to compute the overlap
joins. The last two columns in both 
maxgap 
It should be a nonnegative integer value, >= 0. Default is 0 (no
gap). For intervals 
minoverlap 
It should be a positive integer value, > 0. Default is 1. For
intervals 
type 
Default value is The types shown here are identical in functionality to the function
NB: 
mult 
When multiple rows in 
nomatch 
Same as 
which 
When 
verbose 

Details
Very briefly, foverlaps()
collapses the twocolumn interval in y
to onecolumn of unique values to generate a lookup
table, and
then performs the join depending on the type of overlap
, using the
already available binary search
feature of data.table
. The time
(and space) required to generate the lookup
is therefore proportional
to the number of unique values present in the interval columns of y
when combined together.
Overlap joins takes advantage of the fact that y
is sorted to speedup
finding overlaps. Therefore y
has to be keyed (see ?setkey
)
prior to running foverlaps()
. A key on x
is not necessary,
although it might speed things further. The columns in by.x
argument should correspond to the columns specified in by.y
. The last
two columns should be the interval columns in both by.x
and
by.y
. The first interval column in by.x
should always be <= the
second interval column in by.x
, and likewise for by.y
. The
storage.mode
of the interval columns must be either double
or integer
. It therefore works with bit64::integer64
type as well.
The lookup
generation step could be quite time consuming if the number
of unique values in y
are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for y
.
Value
A new data.table
by joining over the interval columns (along with other
additional identifier columns) specified in by.x
and by.y
.
NB: When which=TRUE
: a)
mult="first" or "last"
returns a
vector
of matching row numbers in y
, and b)
when
mult="all"
returns a data.table with two columns with the first
containing row numbers of x
and the second column with corresponding
row numbers of y
.
nomatch=NA or 0
also influences whether nonmatching rows are returned
or not, as explained above.
See Also
data.table
,
http://www.bioconductor.org/packages/release/bioc/html/IRanges.html,
setNumericRounding
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
foverlaps(x, y, type="any") ## return overlap join
foverlaps(x, y, type="any", mult="first") ## returns only first match
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'
## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
foverlaps(x, y, type="any")
foverlaps(x, y, type="any", nomatch=0L)
foverlaps(x, y, type="within", which=TRUE)
foverlaps(x, y, type="within")
foverlaps(x, y, type="start")
## x and y have different column names  specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"),
type="any", which=TRUE)
