knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
rnndescent
is an R package for finding approximate nearest neighbors, based
heavily on the Python package PyNNDescent
by Leland McInnes, but is a fully independent
reimplementation written in C++. It uses the following techniques:
The easiest way to find k-nearest neighbors and query new data is to use the
rnnd_knn
function, which combine several of the available techniques into
sensible defaults use the rnnd_build
and rnnd_query
functions. For greater
flexibility, the underlying functions used by rnnd_build
and rnnd_query
can
be used directly. The other vignettes in this package describe their use and go
into more detail about the how the methods work.
library(rnndescent)
If you just want the k-nearest neighbors of some data, use rnnd_knn
:
iris_knn <- rnnd_knn(data = iris, k = 5)
The nearest neighbor graph format returned by all functions in this package is a list of two matrices:
idx
-- a matrix of indices of the nearest neighbors. As usual in R, these
are 1-indexed.dist
-- the equivalent distances.lapply(iris_knn, function(x) { head(x, 3) })
rnnd_knn
returns the k-nearest neighbors, but does not return any "index" that
you can use to query new data. To do that, use rnnd_build
. Normally you would
query the index with different from that which you used to build the index, so
let's split iris
up:
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
iris_index <- rnnd_build(iris_even, k = 5)
The index is also a list but with a lot more components (none of which are
intended for manual examination), apart from the the neighbor graph which can
be found under the graph
component in the same format as the return value of
rnnd_knn
:
lapply(iris_index$graph, function(x) { head(x, 3) })
Be aware that for large and high-dimensional data, the returned index can get
very large, especially if you set n_search_trees
to a large value.
To query new data, use rnnd_query
:
iris_odd_nn <- rnnd_query( index = iris_index, query = iris_odd, k = 5 ) lapply(iris_odd_nn, function(x) { head(x, 3) })
You don't need to keep the data that was used to build the index around, because internally, the index stores that (that's one of the reasons the index can get large).
Another use for rnnd_query
is to improve the quality of a k-nearest neighbor
graph. We are using for a query
the same data we used to build iris_index
and specifying via the init
parameter the knn graph we already generated:
iris_knn_improved <- rnnd_query( index = iris_index, query = iris_even, init = iris_index$graph, k = 5 )
If the k-nearest neighbor graph in index$graph
isn't sufficiently high
quality, then result of running rnnd_query
on the same data should be an
improvement. Exactly how much better is hard to say, but you can always compare
the sum of the distances:
c( sum(iris_index$graph$dist), sum(iris_knn_improved$dist) )
In this case, the initial knn has not been improved, which is hardly surprising
due to the size of the dataset. Another function that might be of use is the
neighbor_overlap
function to see how many neighbors are shared
between the two graphs:
neighbor_overlap(iris_index$graph, iris_knn_improved)
As there was no change to the graph, the overlap is 100%. More details on this can be found in the hubness vignette and a more ambitious dataset is covered in the FMNIST article.
rnndescent
is multi-threaded, but by default is single-threaded. Set
n_threads
to set the number of threads you want to use:
iris_index <- rnnd_build(data = iris_even, k = 5, n_threads = 2)
Several different distances are available in rnndescent
beyond the
typically-supported Euclidean and Cosine-based distances in other nearest
neighbor packages. See the metrics vignette for more details.
dgCMatrix
. All the same distances are supported as
for dense matrices.logical
matrix,
then for certain distances intended for binary data, specialized functions will
be used to speed up the computation.There are several options that rnnd_build
and rnnd_query
expose that can
be modified to change the behavior of the different stages of the algorithm.
See the documentation for those functions (e.g. ?rnnd_build
) or the
Random Partition Forests, Nearest Neighbor Descent and Querying Data
vignettes for more details.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.