BiocNeighbors-ties: Handling tied distances

BiocNeighbors-tiesR Documentation

Handling tied distances

Description

Interpreting the warnings when distances are tied in an exact nearest neighbor (NN) search.

The problem of ties

The most obvious problem with ties is that it may affect the identity of the reported neighbors. The various NN search functions will return a constant number of neighbors for each data point. If the kth neighbor is tied with the k+1th neighbor, this requires an arbitrary decision about which data point to retain in the NN set. A milder issue is that the order of the neighbors within the set is arbitrary, which may be important for certain algorithms.

As such, a warning will be raised if tied distances are detected among the k+1 NNs for any of the exact NN search methods. We only consider exact ties at double precision - previous versions of this package would account for numerical imprecision, but this is no longer the case. No warning is given for the approximate methods as their use already implies that a certain degree of inaccuracy is acceptable.

Interaction with random seeds

In general, the exact NN search algorithms in this package are fully deterministic despite the use of stochastic steps during index construction. The only exception occurs when there are tied distances to neighbors, at which point the order and/or identity of the k-nearest neighboring points is not well-defined. This is because, in the presence of ties, the output will depend on the ordering of points in the constructed index from buildKmknn or buildVptree.

Users should set the seed to guarantee consistent (albeit arbitrary) results across different runs of the function. However, note that the exact selection of tied points depends on the numerical precision of the system. Thus, even after setting a seed, there is no guarantee that the results will be reproducible across machines (especially Windows)!

Turning off the warnings

It may ocassionally be appropriate to disable the warnings by setting warn.ties=FALSE. The most obvious scenario is when get.index=FALSE, i.e., we are only interested in the distances to the neighbors. In such cases, the presence of ties does not matter as changes to the identity of tied neighbors do not affect the returned distances (which, for ties, are equal by definition). Similarly, if the seed is set prior to the search, the warnings are unnecessary as the output is fully deterministic.

Author(s)

Aaron Lun

See Also

findKmknn and findVptree for examples where tie warnings are produced.

Examples

vals <- matrix(0, nrow=10, ncol=20)
out <- findKmknn(vals, k=5)


LTLA/kmknn documentation built on Feb. 5, 2024, 6:03 p.m.