Description Usage Arguments Details Value Ties and random seeds Returning raw indices Author(s) References See Also Examples
Use the KMKNN (K-means for k-nearest neighbors) algorithm to identify nearest neighbors from a dataset.
1 2 |
X |
A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions). |
k |
A positive integer scalar specifying the number of nearest neighbors to retrieve. |
get.index |
A logical scalar indicating whether the indices of the nearest neighbors should be recorded. |
get.distance |
A logical scalar indicating whether distances to the nearest neighbors should be recorded. |
BPPARAM |
A BiocParallelParam object indicating how the search should be parallelized. |
precomputed |
The precomputed output of |
subset |
A vector indicating the rows of |
raw.index |
A logial scalar indicating whether raw column indices to |
This function uses the method proposed by Wang (2012) to quickly identify k-nearest neighbors in high-dimensional data.
Briefly, data points are rapidly clustered into N
clusters using k-means clustering in precluster
, where N
is the square root of the number of points.
This clustering is then used to speed up the nearest neighbor search across X
,
exploiting the triangle inequality between cluster centers, the query point and each point in the cluster to narrow the search space.
By default, nearest neighbors are identified for all data points within X
.
If subset
is specified, nearest neighbors are only detected for the points in the subset.
This yields the same result as (but is more efficient than) subsetting the output matrices after running findKNN
with subset=NULL
.
Turning off get.index
or get.distance
will not return the corresponding matrices in the output.
This may provide a slight speed boost when these returned values are not of interest.
Using BPPARAM
will also split the search across multiple workers, which should increase speed proportionally (in theory) to the number of cores.
If the function is to be called multiple times with the same X
(e.g., with different subset
),
it may be faster to call precluster
once externally, and pass the returned object to findKNN
via the precomputed
argument.
This avoids unnecessary repeated k-means clustering and can provide a substantial speed-up.
Note that when precomputed
is supplied, the value of X
is completely ignored.
Currently, only Euclidean distances are supported, but support may be added for other distance types depending on demand. It remains to be seen whether the speed-up achieved with k-means is still applicable to alternative distance metrics.
Note that the code here was originally derived from an implementation in the cydar package (Lun et al., 2017).
A list is returned containing:
index
, if get.index=TRUE
.
This is an integer matrix where each row corresponds to a point (denoted here as i) in X
.
The row for i contains the row indices of X
that are the nearest neighbors to point i, sorted by increasing distance from i.
distance
, if get.distance=TRUE
.
This is a numeric matrix where each row corresponds to a point (as above) and contains the sorted distances of the neighbors from i.
If subset
is not NULL
, each row of the above matrices refers to a point in the subset, in the same order as supplied in subset
.
If raw.index=TRUE
, the values in index
refer to columns of precomputed$data
.
In general, this function is fully deterministic, despite the use of a stochastic kmeans
step in precluster
.
The only exception occurs when there are tied distances to neighbors, at which point the order and/or identity of the k-nearest neighboring points is not well-defined.
A warning will be raised if ties are detected among the k+1
nearest neighbors, as this indicates that the order/identity is arbitrary.
Specifically, ties are detected when a larger distance is less than (1 + 1e-10)
-fold of the smaller distance.
This criterion tends to be somewhat conservative in the sense that it will warn users even if there is no problem (i.e., the distances are truly different).
However, more accurate detection is difficult to achieve due to the vagaries of numerical precision across different machines.
In the presence of ties, the output will depend on the ordering of points in the precluster
output.
Users should set the seed to guarantee consistent (albeit arbitrary) results across different runs of the function.
Note, however, that the exact selection of tied points depends on the numerical precision of the system.
Thus, even after setting a seed, there is no guarantee that the results will be reproducible across machines (especially Windows)!
Advanced users can also set raw.index=TRUE
, which yields results equivalent to running findKNN
on t(precomputed$data)
directly.
With this setting, the indices in the output index
matrix refer to columns of precomputed$data
.
Similarly, the subset
argument is assumed to refer to columns of precomputed$data
.
This may be more convenient when dealing with a common precomputed
object across multiple calls to findKNN
,
as it avoids the need to switch between the original ordering and that from precluster
.
Aaron Lun
Wang X (2012). A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. Proc Int Jt Conf Neural Netw, 43, 6:2351-2358.
Lun ATL, Richard AC, Marioni JC (2017). Testing for differential abundance in mass cytometry data. Nat. Methods, 14, 7:707-709.
1 2 3 4 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.