In jlmelville/rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

rnndescent is an R package for finding approximate nearest neighbors, based heavily on the Python package PyNNDescent by Leland McInnes, but is a fully independent reimplementation written in C++. It uses the following techniques:

Initialization by creating a forest of random project trees [@dasgupta2008random].
Optimization by using nearest neighbor descent [@dong2011efficient].
For building a search graph, graph diversification techniques from FANNG [@harwood2016fanng].
For querying new data, the back-tracking search from NGT [@iwasaki2018optimization] (without dynamic degree-adjustment).

The easiest way to find k-nearest neighbors and query new data is to use the rnnd_knn function, which combine several of the available techniques into sensible defaults use the rnnd_build and rnnd_query functions. For greater flexibility, the underlying functions used by rnnd_build and rnnd_query can be used directly. The other vignettes in this package describe their use and go into more detail about the how the methods work.

library(rnndescent)

Find the k-nearest neighbors

If you just want the k-nearest neighbors of some data, use rnnd_knn:

iris_knn <- rnnd_knn(data = iris, k = 5)

The Neighbor Graph Format

The nearest neighbor graph format returned by all functions in this package is a list of two matrices:

idx -- a matrix of indices of the nearest neighbors. As usual in R, these are 1-indexed.
dist -- the equivalent distances.

lapply(iris_knn, function(x) {
  head(x, 3)
})

Build an Index

rnnd_knn returns the k-nearest neighbors, but does not return any "index" that you can use to query new data. To do that, use rnnd_build. Normally you would query the index with different from that which you used to build the index, so let's split iris up:

iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]

iris_index <- rnnd_build(iris_even, k = 5)

The index is also a list but with a lot more components (none of which are intended for manual examination), apart from the the neighbor graph which can be found under the graph component in the same format as the return value of rnnd_knn:

lapply(iris_index$graph, function(x) {
  head(x, 3)
})

Be aware that for large and high-dimensional data, the returned index can get very large, especially if you set n_search_trees to a large value.

Querying Data

To query new data, use rnnd_query:

iris_odd_nn <- rnnd_query(
  index = iris_index,
  query = iris_odd,
  k = 5
)
lapply(iris_odd_nn, function(x) {
  head(x, 3)
})

You don't need to keep the data that was used to build the index around, because internally, the index stores that (that's one of the reasons the index can get large).

Another use for rnnd_query is to improve the quality of a k-nearest neighbor graph. We are using for a query the same data we used to build iris_index and specifying via the init parameter the knn graph we already generated:

iris_knn_improved <- rnnd_query(
  index = iris_index,
  query = iris_even,
  init = iris_index$graph,
  k = 5
)

If the k-nearest neighbor graph in index$graph isn't sufficiently high quality, then result of running rnnd_query on the same data should be an improvement. Exactly how much better is hard to say, but you can always compare the sum of the distances:

c(
  sum(iris_index$graph$dist),
  sum(iris_knn_improved$dist)
)

In this case, the initial knn has not been improved, which is hardly surprising due to the size of the dataset. Another function that might be of use is the neighbor_overlap function to see how many neighbors are shared between the two graphs:

neighbor_overlap(iris_index$graph, iris_knn_improved)

As there was no change to the graph, the overlap is 100%. More details on this can be found in the hubness vignette and a more ambitious dataset is covered in the FMNIST article.

Parallelism

rnndescent is multi-threaded, but by default is single-threaded. Set n_threads to set the number of threads you want to use:

iris_index <- rnnd_build(data = iris_even, k = 5, n_threads = 2)

Available Metrics

Several different distances are available in rnndescent beyond the typically-supported Euclidean and Cosine-based distances in other nearest neighbor packages. See the metrics vignette for more details.

Supported Data Types

Dense matrices and data frames.
Sparse matrices, in the dgCMatrix. All the same distances are supported as for dense matrices.
Additionally, for dense binary data, if you supply it as a logical matrix, then for certain distances intended for binary data, specialized functions will be used to speed up the computation.

Parameters

There are several options that rnnd_build and rnnd_query expose that can be modified to change the behavior of the different stages of the algorithm. See the documentation for those functions (e.g. ?rnnd_build) or the Random Partition Forests, Nearest Neighbor Descent and Querying Data vignettes for more details.