README.md

rnndescent

R-CMD-check AppVeyor Build Status Coverage Status CRAN Status Badge Dependencies CRAN Monthly Downloads CRAN Downloads Last Commit

An R package for finding approximate nearest neighbors, translated from the Python package PyNNDescent written by the great Leland McInnes. As the name suggests, it uses the Nearest Neighbor Descent method (Dong et al., 2011), but also makes use of Random Partition Trees (Dasgupta and Freund, 2008) as well as ideas from FANNG and NGT.

You can use rnndescent for:

Documentation

See the Get Started article for the basics. The other vignettes go into more detail.

Current Status

18 March 2024: Version 0.1.4 has been released to CRAN. This is a bug fix release. Most notably, it fixes an issue where rnnd_build would fail with metric = "cosine".

Installation

CRAN

install.packages("rnndescent")

Development Version

# install.packages("pak")
pak::pkg_install("jlmelville/rnndescent")

This packages makes use of C++ code which must be compiled. You may have to carry out a few extra steps before being able to build:

Windows: install Rtools and ensure C:\Rtools\bin is on your path.

Mac OS X: using a custom ~/.R/Makevars may cause linking errors. This sort of thing is a potential problem on all platforms but seems to bite Mac owners more. The R for Mac OS X FAQ may be helpful here to work out what you can get away with. To be on the safe side, I would advise building without a custom Makevars.

rnndescent uses C++17. This shouldn't be too big a problem but not all R platforms support it (sorry if this affects you).

Example

library(rnndescent)

# Find 15-knn
iris_knn <- rnnd_knn(iris, k = 15)

# Build an index
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
# Specify the number of neighbors you are likely to want to query for
iris_index <- rnnd_build(iris_index, k = 15)

# Query then index
iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
iris_query_nn <- rnnd_query(index = iris_index, query = iris_odd, k = 15)

For more details, please see the documentation.

Supported Metrics

Many. See the metrics article for a list.

Missing Features

Compared to PyNNDescent, rnndescent is currently lacking, in decreasing order of likelihood of implementation:

The issues around index serialization and parallel behavior make rnndescent currently unsuitable for streaming applications where you are querying one item at a time. If you are doing batch queries, where you are querying multiple items at once, then rnndescent should be fine: for example, generating nearest neighbors for UMAP (maybe for use with uwot). Dimensionality reduction is my personal use case for nearest neighbors calculation and I would like to get rnndescent onto CRAN in a useful-for-something state. As a result I am not targeting an initial release to support the streaming case. I would like to fix this for a subsequent release.

Also there is no specialized distance code to take advantage of AVX etc., so rnndescent is going to run slower than other packages. This wouldn't be allowed on CRAN anyway, but might be a nice-to-have for installing from github.

Citation

Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web (pp. 577-586). ACM. doi.org/10.1145/1963405.1963487.

License

The R package as a whole is licensed under GPLv3 or later. The following files are licensed differently:

As far as I know, these licenses are all compatible with re-licensing under GPLv3 or later.

Missing Metrics

The following metrics are in PyNNDescent but are not supported in rnndescent:

These require passing extra information as part of the metric definition, which is not currently supported.

See Also



jlmelville/rnndescent documentation built on April 14, 2024, 4:33 p.m.