Metrics"
In rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(rnndescent)

A lot of distance functions are implemented in rnndescent, which you can specify in every function which needs them with the metric parameter. Technically not all of these are metrics, but let's just let that slide. Typical are "euclidean" or "cosine" the latter being more common for document-based data. For binary data, "hamming" or "jaccard" might be a good place to start.

The metrics here are a subset of those offered by the PyNNDescent Python package which in turn reproduces those in the scipy.spatial.distance module of SciPy. Many of the binary distances seem to have definitions shared with [@choi2010survey] so you may want to look in that reference for an exact definition.

"braycurtis": Bray-Curtis.
"canberra": Canberra.
"chebyshev": Chebyshev, also known as the L-infinity norm ($L_\infty$).
"correlation": 1 minus the Pearson correlation.
"cosine": 1 minus the cosine similarity.
"dice": the Dice coefficient, also known as the Sørensen–Dice coefficient. Intended for binary data.
"euclidean": the Euclidean distance, also known as the L2 norm.
"hamming": the Hamming distance. Intended for binary data.
"hellinger": the Hellinger distance. This is intended to be used with a probability distribution, so ensure that each row of your input data contains non-negative values which sum to 1.
"jaccard": the Jaccard index, also known as the Tanimoto coefficient. Intended for binary data.
"jensenshannon": the Jensen-Shannon divergence. Like "hellinger", this is intended to be used with a probability distribution.
"kulsinski": the Kulsinski dissimilarity as defined in the Python package scipy.spatial.distance.kulsinski (this function is deprecated in scipy). Intended for binary data.
"sqeuclidean" (squared Euclidean)
"manhattan": the Manhattan distance, also known as the L1 norm or Taxicab distance.
"rogerstanimoto": the Rogers-Tanimoto coefficient.
"russellrao": the Russell-Rao coefficient.
"sokalmichener". Intended for binary data.
"sokalsneath": the Sokal-Sneath coefficient. Intended for binary data.
"spearmanr": 1 minus the Spearman rank correlation
"symmetrickl" symmetrized version of the Kullback-Leibler divergence. The symmetrization is calculated as $D_{KL}(P||Q) + D_{KL}(Q||P)$.
"tsss" the Triangle Area Similarity-Sector Area Similarity or TS-SS metric as described in [@7474366]. Compared to results in PyNNDescent (as of version 0.5.11), distances are smaller by a factor of 2 in this package. This does not affect the returned nearest neighbors, only the distances. Multiply them by 2 if you need to get closer to the PyNNDescent results.
"yule" the Yule dissimilarity. Intended for binary data.

For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:

"cosine-preprocess": cosine with preprocessing.
"correlation-preprocess": correlation with preprocessing.

Specialized Binary Metrics

Some metrics are intended for use with binary data. This means that:

Your numeric data should consist of only two distinct values, typically 0 and 1. You will get unpredictable results otherwise.
If you provide the data as a logical matrix, a much faster implementation is used.

The metrics you can use with binary data are:

"dice"
"hamming"
"jaccard"
"kulsinski"
"matching"
"rogerstanimoto"
"russellrao"
"sokalmichener"
"sokalsneath"
"yule"

Here's an example of using binary data stored as 0s and 1s with the "hamming" metric:

set.seed(42)
binary_data <- matrix(sample(c(0, 1), 100, replace = TRUE), ncol = 10)
head(binary_data)

nn <- brute_force_knn(binary_data, k = 4, metric = "hamming")

Now let's convert it to a logical matrix:

logical_data <- binary_data == 1
head(logical_data)

logical_nn <- brute_force_knn(logical_data, k = 4, metric = "hamming")

The results will be the same:

all.equal(nn, logical_nn)

but on a real-world dataset, the logical version will be much faster.

References

Any scripts or data that you put into this service are public.

rnndescent documentation built on May 29, 2024, 8:38 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rnndescent
Nearest Neighbor Descent Method for Approximate Nearest Neighbors

Metrics"
In rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors

Specialized Binary Metrics

References

Try the rnndescent package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

rnndescent Nearest Neighbor Descent Method for Approximate Nearest Neighbors

Metrics" In rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors

Specialized Binary Metrics

References

Try the rnndescent package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

rnndescent
Nearest Neighbor Descent Method for Approximate Nearest Neighbors

Metrics"
In rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors