require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) library(BiocNeighbors)
The r Biocpkg("BiocNeighbors")
package implements a few algorithms for exact nearest neighbor searching:
Both KMKNN and VP-trees involve a component of randomness during index construction, though the k-nearest neighbors result is fully deterministic^[Except in the presence of ties, see ?"BiocNeighbors-ties"
for details.].
The most obvious application is to perform a k-nearest neighbors search. We'll mock up an example here with a hypercube of points, for which we want to identify the 10 nearest neighbors for each point.
nobs <- 10000 ndim <- 20 data <- matrix(runif(nobs*ndim), ncol=ndim)
The findKNN()
method expects a numeric matrix as input with data points as the rows and variables/dimensions as the columns.
We indicate that we want to use the KMKNN algorithm by setting BNPARAM=KmknnParam()
(which is also the default, so this is not strictly necessary here).
We could use a VP tree instead by setting BNPARAM=VptreeParam()
.
fout <- findKNN(data, k=10, BNPARAM=KmknnParam()) head(fout$index) head(fout$distance)
Each row of the index
matrix corresponds to a point in data
and contains the row indices in data
that are its nearest neighbors.
For example, the 3rd point in data
has the following nearest neighbors:
fout$index[3,]
... with the following distances to those neighbors:
fout$distance[3,]
Note that the reported neighbors are sorted by distance.
Another application is to identify the k-nearest neighbors in one dataset based on query points in another dataset. Again, we mock up a small data set:
nquery <- 1000 ndim <- 20 query <- matrix(runif(nquery*ndim), ncol=ndim)
We then use the queryKNN()
function to identify the 5 nearest neighbors in data
for each point in query
.
qout <- queryKNN(data, query, k=5, BNPARAM=KmknnParam()) head(qout$index) head(qout$distance)
Each row of the index
matrix contains the row indices in data
that are the nearest neighbors of a point in query
.
For example, the 3rd point in query
has the following nearest neighbors in data
:
qout$index[3,]
... with the following distances to those neighbors:
qout$distance[3,]
Again, the reported neighbors are sorted by distance.
Users can perform the search for a subset of query points using the subset=
argument.
This yields the same result as but is more efficient than performing the search for all points and subsetting the output.
findKNN(data, k=5, subset=3:5)
If only the indices are of interest, users can set get.distance=FALSE
to avoid returning the matrix of distances.
This will save some time and memory.
names(findKNN(data, k=2, get.distance=FALSE))
It is also simple to speed up functions by parallelizing the calculations with the r Biocpkg("BiocParallel")
framework.
library(BiocParallel) out <- findKNN(data, k=10, BPPARAM=MulticoreParam(3))
For multiple queries to a constant data
, the pre-clustering can be performed in a separate step with buildIndex()
.
The result can then be passed to multiple calls, avoiding the overhead of repeated clustering^[The algorithm type is automatically determined when BNINDEX
is specified, so there is no need to also specify BNPARAM
in the later functions.].
pre <- buildIndex(data, BNPARAM=KmknnParam()) out1 <- findKNN(BNINDEX=pre, k=5) out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
The default setting is to search on the Euclidean distance.
Alternatively, we can use the Manhattan distance by setting distance="Manhattan"
in the BiocNeighborParam
object.
out.m <- findKNN(data, k=5, BNPARAM=KmknnParam(distance="Manhattan"))
Advanced users may also be interested in the raw.index=
argument, which returns indices directly to the precomputed object rather than to data
.
This may be useful inside package functions where it may be more convenient to work on a common precomputed object.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.