findKNN: Find nearest neighbors

Description Usage Arguments Details Value Ties and random seeds Returning raw indices Author(s) References See Also Examples

View source: R/findKNN.R

Description

Use the KMKNN (K-means for k-nearest neighbors) algorithm to identify nearest neighbors from a dataset.

Usage

1
2
findKNN(X, k, get.index=TRUE, get.distance=TRUE, BPPARAM=SerialParam(), 
    precomputed=NULL, subset=NULL, raw.index=FALSE)

Arguments

X

A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions).

k

A positive integer scalar specifying the number of nearest neighbors to retrieve.

get.index

A logical scalar indicating whether the indices of the nearest neighbors should be recorded.

get.distance

A logical scalar indicating whether distances to the nearest neighbors should be recorded.

BPPARAM

A BiocParallelParam object indicating how the search should be parallelized.

precomputed

The precomputed output of precluster on X.

subset

A vector indicating the rows of X for which the nearest neighbors should be identified.

raw.index

A logial scalar indicating whether raw column indices to precomputed$data should be returned.

Details

This function uses the method proposed by Wang (2012) to quickly identify k-nearest neighbors in high-dimensional data. Briefly, data points are rapidly clustered into N clusters using k-means clustering in precluster, where N is the square root of the number of points. This clustering is then used to speed up the nearest neighbor search across X, exploiting the triangle inequality between cluster centers, the query point and each point in the cluster to narrow the search space.

By default, nearest neighbors are identified for all data points within X. If subset is specified, nearest neighbors are only detected for the points in the subset. This yields the same result as (but is more efficient than) subsetting the output matrices after running findKNN with subset=NULL.

Turning off get.index or get.distance will not return the corresponding matrices in the output. This may provide a slight speed boost when these returned values are not of interest. Using BPPARAM will also split the search across multiple workers, which should increase speed proportionally (in theory) to the number of cores.

If the function is to be called multiple times with the same X (e.g., with different subset), it may be faster to call precluster once externally, and pass the returned object to findKNN via the precomputed argument. This avoids unnecessary repeated k-means clustering and can provide a substantial speed-up. Note that when precomputed is supplied, the value of X is completely ignored.

Currently, only Euclidean distances are supported, but support may be added for other distance types depending on demand. It remains to be seen whether the speed-up achieved with k-means is still applicable to alternative distance metrics.

Note that the code here was originally derived from an implementation in the cydar package (Lun et al., 2017).

Value

A list is returned containing:

If subset is not NULL, each row of the above matrices refers to a point in the subset, in the same order as supplied in subset.

If raw.index=TRUE, the values in index refer to columns of precomputed$data.

Ties and random seeds

In general, this function is fully deterministic, despite the use of a stochastic kmeans step in precluster. The only exception occurs when there are tied distances to neighbors, at which point the order and/or identity of the k-nearest neighboring points is not well-defined.

A warning will be raised if ties are detected among the k+1 nearest neighbors, as this indicates that the order/identity is arbitrary. Specifically, ties are detected when a larger distance is less than (1 + 1e-10)-fold of the smaller distance. This criterion tends to be somewhat conservative in the sense that it will warn users even if there is no problem (i.e., the distances are truly different). However, more accurate detection is difficult to achieve due to the vagaries of numerical precision across different machines.

In the presence of ties, the output will depend on the ordering of points in the precluster output. Users should set the seed to guarantee consistent (albeit arbitrary) results across different runs of the function. Note, however, that the exact selection of tied points depends on the numerical precision of the system. Thus, even after setting a seed, there is no guarantee that the results will be reproducible across machines (especially Windows)!

Returning raw indices

Advanced users can also set raw.index=TRUE, which yields results equivalent to running findKNN on t(precomputed$data) directly. With this setting, the indices in the output index matrix refer to columns of precomputed$data. Similarly, the subset argument is assumed to refer to columns of precomputed$data. This may be more convenient when dealing with a common precomputed object across multiple calls to findKNN, as it avoids the need to switch between the original ordering and that from precluster.

Author(s)

Aaron Lun

References

Wang X (2012). A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. Proc Int Jt Conf Neural Netw, 43, 6:2351-2358.

Lun ATL, Richard AC, Marioni JC (2017). Testing for differential abundance in mass cytometry data. Nat. Methods, 14, 7:707-709.

See Also

precluster, queryKNN

Examples

1
2
3
4
Y <- matrix(rnorm(100000), ncol=20)
out <- findKNN(Y, k=25)
head(out$index)
head(out$distance)

kmknn documentation built on Nov. 1, 2018, 4:21 a.m.