findKNN-functions: Find nearest neighbors
In LTLA/BiocNeighbors: Nearest Neighbor Detection for Bioconductor Packages

findKNN-functions

R Documentation

Find nearest neighbors

Description

Find the nearest neighbors of each point in a dataset, using a variety of algorithms.

Usage

findAnnoy(
  X,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  subset = NULL,
  raw.index = NA,
  warn.ties = NA,
  ...
)

findHnsw(
  X,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  subset = NULL,
  raw.index = NA,
  warn.ties = NA,
  ...
)

findKmknn(
  X,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  subset = NULL,
  raw.index = FALSE,
  warn.ties = TRUE,
  ...
)

findVptree(
  X,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  subset = NULL,
  raw.index = FALSE,
  warn.ties = TRUE,
  ...
)

findExhaustive(
  X,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  subset = NULL,
  raw.index = FALSE,
  warn.ties = TRUE,
  ...
)

Arguments

`X`	A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions).
`k`	A positive integer scalar specifying the number of nearest neighbors to retrieve.
`get.index`	A logical scalar indicating whether the indices of the nearest neighbors should be recorded.
`get.distance`	A logical scalar indicating whether distances to the nearest neighbors should be recorded.
`last`	An integer scalar specifying the number of furthest neighbors for which statistics should be returned.
`BPPARAM`	A BiocParallelParam object indicating how the search should be parallelized.
`precomputed`	A BiocNeighborIndex object of the appropriate class, generated from `X`. For `findExhaustive`, this should be a ExhaustiveIndex from `buildExhaustive`; For `findKmknn`, this should be a KmknnIndex from `buildKmknn`; for `findVptree`, this should be a VptreeIndex from `buildVptree`; for `findAnnoy`, this should be a AnnoyIndex from `buildAnnoy`; and for `findHnsw`, this should be a HnswIndex from `buildHnsw`.
`subset`	A vector indicating the rows of `X` for which the nearest neighbors should be identified.
`raw.index`	A logial scalar indicating whether raw column indices should be returned, see `?"BiocNeighbors-raw-index"`. This argument is ignored for `findAnnoy` and `findHnsw`.
`warn.ties`	Logical scalar indicating whether a warning should be raised if any of the `k+1` neighbors have tied distances. This argument is ignored for `findAnnoy` and `findHnsw`.
`...`	Further arguments to pass to the respective `build*` function for each algorithm. This includes `distance`, a string specifying whether `"Euclidean"`, `"Manhattan"` or `"Cosine"` distances are to be used.

Details

All of these functions identify points in X that are the k nearest neighbors of each other point. findAnnoy and findHnsw perform an approximate search, while findKmknn and findVptree are exact. The upper bound for k is set at the number of points in X minus 1.

By default, nearest neighbors are identified for all data points within X. If subset is specified, nearest neighbors are only detected for the points in the subset. This yields the same result as (but is more efficient than) subsetting the output matrices after running findKmknn with subset=NULL.

Turning off get.index or get.distance will not return the corresponding matrices in the output. This may provide a slight speed boost when these returned values are not of interest. Using BPPARAM will also split the search across multiple workers, which should increase speed proportionally (in theory) to the number of cores.

Setting last will return indices and/or distances for the k - last + 1-th closest neighbor to the k-th neighbor. This can be used to improve memory efficiency, e.g., by only returning statistics for the k-th nearest neighbor by setting last=1. Note that this is entirely orthogonal to subset.

If multiple queries are to be performed to the same X, it may be beneficial to build the index from X (e.g., with buildKmknn). The resulting BiocNeighborIndex object can be supplied as precomputed to multiple function calls, avoiding the need to repeat index construction in each call. Note that when precomputed is supplied, the value of X is completely ignored.

For exact methods, see comments in ?"BiocNeighbors-ties" regarding the warnings when tied distances are observed. For approximate methods, see comments in buildAnnoy and buildHnsw about the (lack of) randomness in the search results.

Value

A list is returned containing:

index, if get.index=TRUE. This is an integer matrix where each row corresponds to a point (denoted here as i) in X. The row for i contains the row indices of X that are the nearest neighbors to point i, sorted by increasing distance from i.
distance, if get.distance=TRUE. This is a numeric matrix where each row corresponds to a point (as above) and contains the sorted distances of the neighbors from i.

Each matrix contains last columns. If subset is not NULL, each row of the above matrices refers to a point in the subset, in the same order as supplied in subset.

See ?"BiocNeighbors-raw-index" for an explanation of the output when raw.index=TRUE for the functions that support it.

Author(s)

Aaron Lun

Examples

Y <- matrix(rnorm(100000), ncol=20)

out <- findExhaustive(Y, k=8)
head(out$index)
head(out$distance)

out1 <- findKmknn(Y, k=8)
head(out1$index)
head(out1$distance)

out2 <- findVptree(Y, k=8)
head(out2$index)
head(out2$distance)

out3 <- findAnnoy(Y, k=8)
head(out3$index)
head(out3$distance)

out4 <- findHnsw(Y, k=8)
head(out4$index)
head(out4$distance)

LTLA/BiocNeighbors documentation built on Jan. 14, 2024, 9:46 p.m.