require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) library(BiocStyle) self <- Biocpkg("BiocNeighbors") set.seed(99999)
The r self
package implements a variety of nearest-neighbor (NN) search algorithms with a consistent interface.
This includes exact search methods like
vantage point trees,
k-means k-nearest neighbors,
and an exhaustive brute-force search,
as well as approximate methods like
Approximate Nearest Neighbors Oh Yeah (Annoy)
and hierarchical navigable small worlds (HNSW).
We provides methods to search nearest neighbors within a dataset, query across datasets,
and (for some algorithms) find all neighbors within a certain distance.
The findKNN()
function will find the k
-nearest neighbors within a dataset,
returning one matrix of neighbor identities and another matrix of (sorted) distances to each neighbor.
# Mocking up a matrix with 'nobs' points in the rows # and 'ndim' dimensions in the columns. nobs <- 1000 ndim <- 20 data <- matrix(runif(nobs*ndim), ncol=ndim) library(BiocNeighbors) fout <- findKNN(data, k=10) str(fout)
Each row of the index
matrix corresponds to a row in data
and contains the row indices in data
that are its nearest neighbors.
For example, the 3rd point in data
has the following nearest neighbors with their (sorted) distances:
fout$index[3,] fout$distance[3,]
Users can easily change the algorithm by setting the BNPARAM=
argument.
For example, the code below switches to the Annoy algorithm for a faster, less accurate search.
aout <- findKNN(data, k=10, BNPARAM=AnnoyParam()) str(aout)
The same mechanism can be used to change the distance metric from the Euclidean default.
vout <- findKNN(data, k=10, BNPARAM=VptreeParam(distance="Manhattan")) str(vout)
If the number of neighbors differs for each point, we can supply an integer vector to k=
instead.
This yields a list of vectors containing the neighbors and their distances to each point.
var.k <- sample(10, nrow(data), replace=TRUE) # use I() to distinguish between scalar and length-1 vectors. var.out <- findKNN(data, k=I(var.k)) head(var.out$index) head(var.out$distance)
queryKNN()
is a related function that will find the k
-nearest neighbors in one dataset based on query points in another dataset.
Here, the rows of the output matrices correspond to rows of our query
matrix.
nquery <- 100 ndim <- 20 query <- matrix(runif(nquery*ndim), ncol=ndim) qout <- queryKNN(data, query, k=5) str(qout)
When performing multiple calls to findKNN()
or queryKNN()
on the same data,
advanced users may prefer to build the index first with buildIndex()
.
This can be efficiently reused without repeating the index construction in each call.
prebuilt <- buildIndex(data) out1 <- findKNN(prebuilt, k=5) out2 <- queryKNN(prebuilt, query, k=5)
In some applications, we need to find all neighboring points within a certain distance threshold.
This is achieved using the findNeighbors()
function:
nobs <- 8000 ndim <- 20 data <- matrix(runif(nobs*ndim), ncol=ndim) fout <- findNeighbors(data, threshold=1) head(fout$index) head(fout$distance)
Each entry of the index
list corresponds to a point in data
and contains the row indices in data
that are within threshold
.
For example, the 3rd point in data
has the following neighbors with the associated distances:
fout$index[[3]] fout$distance[[3]]
Again, we can switch algorithms by specifying BNPARAM=
.
However, keep in mind that not all algorithms (particularly the approximate methods) support this range-based search.
vparam <- VptreeParam(distance="Manhattan") vout <- findNeighbors(data, threshold=1, BNPARAM=vparam)
queryNeighbors()
is a related function that will identify all points within a certain distance of a query point.
Each entry of the returned index
and distance
corresponds to a row of query
and describes its neighbors in data
.
nquery <- 100 ndim <- 20 query <- matrix(runif(nquery*ndim), ncol=ndim) qout <- queryNeighbors(data, query, threshold=1) head(qout$index)
As described above, advanced users can built the index first before repeated calls to findNeighbors()
and queryNeighbors()
.
prebuilt <- buildIndex(data) out1 <- findNeighbors(prebuilt, threshold=1) out2 <- queryNeighbors(prebuilt, query,threshold=1)
If only the number of neighbors is of interest, we can set get.index=FALSE
and get.distance=FALSE
in the findNeighbors()
/queryNeighbors()
call.
This will count the number of neighbors without storing their identities/distances for greater memory efficiency.
num <- findNeighbors(data, threshold=1, get.index=FALSE, get.distance=FALSE) head(num)
R/Bioconductor package developers can use r self
to perform nearest-neighbor searches in their own C++ code.
This uses the interfaces in the knncolle library to abstract away the underlying search methods.
First, we add some linking instructions in the DESCRIPTION
to make the header files available:
LinkingTo: Rcpp, assorthead, BiocNeighbors
This includes the BiocNeighbors.h
file that implements the BiocNeigbors::Prebuilt
class:
system.file("include", "BiocNeighbors.h", package="BiocNeighbors")
When a buildIndex()
method returns an external pointer, that pointer is expected to refer to an instance of a BiocNeighbors::Prebuilt
.
Thus, downstream code just needs to accept a pointer and cast it to a BiocNeighbors::Prebuilt
for use in nearest-neighbor searches.
(The same approach can be applied to obtain a BiocNeighbors::Builder
for construction of new indices.)
SEXP do_something(SEXP raw_ptr, int k) { BiocNeighbors::PrebuiltPointer ptr(raw_ptr); const auto& prebuilt = ptr->index; auto searcher = prebuilt->initialize(); std::vector<int> indices; std::vector<double> distances; searcher->search(0, k, &indices, &distances); return Rcpp::List::create( Rcpp::IntegerVector(indices.begin(), indices.end()), Rcpp::IntegerVector(distances.begin(), distances.end()) ); }
Some extra work is required for cosine-normalized indices, which is indicated by ptr->cosine == true
.
Any query data should also be L2-normalized before use in Searcher::search()
or Searcher::search_all()
.
(This is not required for overloads that accept an index for searching within a dataset, only for queries between datasets.)
Developers can extend the r self
framework to more algorithms through the creation of new BiocNeighborParam
classes.
Users can then switch between algorithms by simply changing the BNPARAM=
argument in their calls to findKNN()
, etc.
In general, we recommend putting these extensions in a new R/Bioconductor package that adds more methods to the r self
generics.
Developers can choose to write pure R extensions or they can write them partially in C++.
Each extension package should:
SomeNewParam
S4 class that inherits from BiocNeighborParam
.SomeNewIndex
S4 class that inherits from BiocNeighborIndex
and contains an arbitrary index structure.buildIndex()
S4 method, dispatching on a matrix
type for X=
and a SomeNewParam
type for BNPARAM=
.
This should return an instance of SomeNewIndex
.findKNN()
and queryKNN()
(and optionally findNeighbors()
queryNeighbors()
).
One method should dispatch on a matrix
type for X=
and a SomeNewParam
type for BNPARAM=
.
The other method should dispatch on a prebuilt SomeNewIndex
type for X=
and an ANY
type for BNPARAM=
(as the BNPARAM=
is ignored for prebuilt indices).Alternatively, each extension package should:
Builder
, Prebuilt
and Searcher
interfaces in the knncolle library.
This requires LinkingTo: assorthead, BiocNeighbors
in the DESCRIPTION
.SomeNewParam
S4 class that inherits from BiocNeighborParam
.SomeNewIndex
S4 class that inherits from BiocNeighborGenericIndex
.
This should contain a ptr
slot for an external pointer and a names
slot for the observation names.defineBuilder()
method, dispatching on a matrix
type for X=
and a SomeNewParam
type for BNPARAM=
.
This should call into C++ and return a list containing builder
,
an external pointer to an instance of a BiocNeighbors::Builder
(see definition in BiocNeighbors.h
);
and class
, the constructor for the new SomeNewIndex
class.No new methods are required for findKNN()
, queryNeighbors()
, etc. as a BiocNeighborGeneric
method is already available for each generic.
This approach also allows the new method to be used in C++ code of downstream packages that accept a BiocNeighbors::Prebuilt
instance.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.