Home

/

kmknn

/

findKNN: Find nearest neighbors
In kmknn: k-Means for k-Nearest Neighbors

Description Usage Arguments Details Value Ties and random seeds Returning raw indices Author(s) References See Also Examples

View source: R/findKNN.R

Use the KMKNN (K-means for k-nearest neighbors) algorithm to identify nearest neighbors from a dataset.

1 2	findKNN(X, k, get.index=TRUE, get.distance=TRUE, BPPARAM=SerialParam(), precomputed=NULL, subset=NULL, raw.index=FALSE)

`X`	A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions).
`k`	A positive integer scalar specifying the number of nearest neighbors to retrieve.
`get.index`	A logical scalar indicating whether the indices of the nearest neighbors should be recorded.
`get.distance`	A logical scalar indicating whether distances to the nearest neighbors should be recorded.
`BPPARAM`	A BiocParallelParam object indicating how the search should be parallelized.
`precomputed`	The precomputed output of `precluster` on `X`.
`subset`	A vector indicating the rows of `X` for which the nearest neighbors should be identified.
`raw.index`	A logial scalar indicating whether raw column indices to `precomputed$data` should be returned.

This function uses the method proposed by Wang (2012) to quickly identify k-nearest neighbors in high-dimensional data. Briefly, data points are rapidly clustered into N clusters using k-means clustering in precluster, where N is the square root of the number of points. This clustering is then used to speed up the nearest neighbor search across X, exploiting the triangle inequality between cluster centers, the query point and each point in the cluster to narrow the search space.

By default, nearest neighbors are identified for all data points within X. If subset is specified, nearest neighbors are only detected for the points in the subset. This yields the same result as (but is more efficient than) subsetting the output matrices after running findKNN with subset=NULL.

Turning off get.index or get.distance will not return the corresponding matrices in the output. This may provide a slight speed boost when these returned values are not of interest. Using BPPARAM will also split the search across multiple workers, which should increase speed proportionally (in theory) to the number of cores.

If the function is to be called multiple times with the same X (e.g., with different subset), it may be faster to call precluster once externally, and pass the returned object to findKNN via the precomputed argument. This avoids unnecessary repeated k-means clustering and can provide a substantial speed-up. Note that when precomputed is supplied, the value of X is completely ignored.

Currently, only Euclidean distances are supported, but support may be added for other distance types depending on demand. It remains to be seen whether the speed-up achieved with k-means is still applicable to alternative distance metrics.

Note that the code here was originally derived from an implementation in the cydar package (Lun et al., 2017).

A list is returned containing:

index, if get.index=TRUE. This is an integer matrix where each row corresponds to a point (denoted here as i) in X. The row for i contains the row indices of X that are the nearest neighbors to point i, sorted by increasing distance from i.
distance, if get.distance=TRUE. This is a numeric matrix where each row corresponds to a point (as above) and contains the sorted distances of the neighbors from i.

If subset is not NULL, each row of the above matrices refers to a point in the subset, in the same order as supplied in subset.

If raw.index=TRUE, the values in index refer to columns of precomputed$data.

In general, this function is fully deterministic, despite the use of a stochastic kmeans step in precluster. The only exception occurs when there are tied distances to neighbors, at which point the order and/or identity of the k-nearest neighboring points is not well-defined.

A warning will be raised if ties are detected among the k+1 nearest neighbors, as this indicates that the order/identity is arbitrary. Specifically, ties are detected when a larger distance is less than (1 + 1e-10)-fold of the smaller distance. This criterion tends to be somewhat conservative in the sense that it will warn users even if there is no problem (i.e., the distances are truly different). However, more accurate detection is difficult to achieve due to the vagaries of numerical precision across different machines.

In the presence of ties, the output will depend on the ordering of points in the precluster output. Users should set the seed to guarantee consistent (albeit arbitrary) results across different runs of the function. Note, however, that the exact selection of tied points depends on the numerical precision of the system. Thus, even after setting a seed, there is no guarantee that the results will be reproducible across machines (especially Windows)!

Advanced users can also set raw.index=TRUE, which yields results equivalent to running findKNN on t(precomputed$data) directly. With this setting, the indices in the output index matrix refer to columns of precomputed$data. Similarly, the subset argument is assumed to refer to columns of precomputed$data. This may be more convenient when dealing with a common precomputed object across multiple calls to findKNN, as it avoids the need to switch between the original ordering and that from precluster.

Aaron Lun

Wang X (2012). A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. Proc Int Jt Conf Neural Netw, 43, 6:2351-2358.

Lun ATL, Richard AC, Marioni JC (2017). Testing for differential abundance in mass cytometry data. Nat. Methods, 14, 7:707-709.

precluster, queryKNN

Y <- matrix(rnorm(100000), ncol=20)
out <- findKNN(Y, k=25)
head(out$index)
head(out$distance)

kmknn documentation built on Nov. 1, 2018, 4:21 a.m.

kmknn index

Detecting nearest neighbors using the kmknn package

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kmknn
k-Means for k-Nearest Neighbors

findKNN: Find nearest neighbors
In kmknn: k-Means for k-Nearest Neighbors

Description

Usage

Arguments

Details

Value

Ties and random seeds

Returning raw indices

Author(s)

References

See Also

Examples

Related to findKNN in kmknn...

R Package Documentation

Browse R Packages

We want your feedback!

kmknn k-Means for k-Nearest Neighbors

findKNN: Find nearest neighbors In kmknn: k-Means for k-Nearest Neighbors

Description

Usage

Arguments

Details

Value

Ties and random seeds

Returning raw indices

Author(s)

References

See Also

Examples

Related to findKNN in kmknn...

R Package Documentation

Browse R Packages

We want your feedback!

kmknn
k-Means for k-Nearest Neighbors

findKNN: Find nearest neighbors
In kmknn: k-Means for k-Nearest Neighbors