queryKNN-functions: Query nearest neighbors

queryKNN-functionsR Documentation

Query nearest neighbors

Description

Query a dataset for nearest neighbors of points in another dataset, using a variety of algorithms.

Usage

queryAnnoy(
  X,
  query,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  transposed = FALSE,
  subset = NULL,
  raw.index = NA,
  warn.ties = NA,
  ...
)

queryHnsw(
  X,
  query,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  transposed = FALSE,
  subset = NULL,
  raw.index = NA,
  warn.ties = NA,
  ...
)

queryKmknn(
  X,
  query,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  transposed = FALSE,
  subset = NULL,
  raw.index = FALSE,
  warn.ties = TRUE,
  ...
)

queryVptree(
  X,
  query,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  transposed = FALSE,
  subset = NULL,
  raw.index = FALSE,
  warn.ties = TRUE,
  ...
)

queryExhaustive(
  X,
  query,
  k,
  get.index = TRUE,
  get.distance = TRUE,
  last = k,
  BPPARAM = SerialParam(),
  precomputed = NULL,
  transposed = FALSE,
  subset = NULL,
  raw.index = FALSE,
  warn.ties = TRUE,
  ...
)

Arguments

X

A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions).

query

A numeric matrix of query points, containing different data points in the rows but the same number and ordering of dimensions in the columns.

k

A positive integer scalar specifying the number of nearest neighbors to retrieve.

get.index

A logical scalar indicating whether the indices of the nearest neighbors should be recorded.

get.distance

A logical scalar indicating whether distances to the nearest neighbors should be recorded.

last

An integer scalar specifying the number of furthest neighbors for which statistics should be returned.

BPPARAM

A BiocParallelParam object indicating how the search should be parallelized.

precomputed

A BiocNeighborIndex object of the appropriate class, generated from X. For findExhaustive, this should be a ExhaustiveIndex from buildExhaustive; For findKmknn, this should be a KmknnIndex from buildKmknn; for findVptree, this should be a VptreeIndex from buildVptree; for findAnnoy, this should be a AnnoyIndex from buildAnnoy; and for findHnsw, this should be a HnswIndex from buildHnsw.

transposed

A logical scalar indicating whether the query is transposed, in which case query is assumed to contain dimensions in the rows and data points in the columns.

subset

A vector indicating the rows of query (or columns, if transposed=TRUE) for which the nearest neighbors should be identified.

raw.index

A logial scalar indicating whether raw column indices should be returned, see ?"BiocNeighbors-raw-index". This argument is ignored for findAnnoy and findHnsw.

warn.ties

Logical scalar indicating whether a warning should be raised if any of the k+1 neighbors have tied distances. This argument is ignored for findAnnoy and findHnsw.

...

Further arguments to pass to the respective build* function for each algorithm. This includes distance, a string specifying whether "Euclidean", "Manhattan" or "Cosine" distances are to be used.

Details

All of these functions identify points in X that are the k nearest neighbors of each point in query. queryAnnoy performs an approximate search, while queryExhaustive, queryKmknn and queryVptree are exact. This requires both X and query to have the same number of dimensions. Moreover, the upper bound for k is set at the number of points in X.

By default, nearest neighbors are identified for all data points within query. If subset is specified, nearest neighbors are only detected for the query points in the subset. This yields the same result as (but is more efficient than) subsetting the output matrices after running queryKmknn on the full query.

If transposed=TRUE, this function assumes that query is already transposed, which saves a bit of time by avoiding an unnecessary transposition. Turning off get.index or get.distance may also provide a slight speed boost when these returned values are not of interest. Using BPPARAM will also split the search by query points across multiple processes.

Setting last will return indices and/or distances for the k - last + 1-th closest neighbor to the k-th neighbor. This can be used to improve memory efficiency, e.g., by only returning statistics for the k-th nearest neighbor by setting last=1. Note that this is entirely orthogonal to subset.

If multiple queries are to be performed to the same X, it may be beneficial to build the index from X (e.g., with buildKmknn). The resulting BiocNeighborIndex object can be supplied as precomputed to multiple function calls, avoiding the need to repeat index construction in each call. Note that when precomputed is supplied, the value of X is ignored.

For exact methods, see comments in ?"BiocNeighbors-ties" regarding the warnings when tied distances are observed. For approximate methods, see comments in buildAnnoy and buildHnsw about the (lack of) randomness in the search results.

Value

A list is returned containing:

  • index, if get.index=TRUE. This is an integer matrix where each row corresponds to a point (denoted here as i) in query. The row for i contains the row indices of X that are the nearest neighbors to point i, sorted by increasing distance from i.

  • distance, if get.distance=TRUE. This is a numeric matrix where each row corresponds to a point (as above) and contains the sorted distances of the neighbors from i.

Each matrix contains last columns. If subset is not NULL, each row of the above matrices refers to a point in the subset, in the same order as supplied in subset.

See ?"BiocNeighbors-raw-index" for an explanation of the output when raw.index=TRUE for the functions that support it.

Author(s)

Aaron Lun

See Also

buildExhaustive, buildKmknn, buildVptree, or buildAnnoy to build an index ahead of time.

See ?"BiocNeighbors-algorithms" for an overview of the available algorithms.

Examples

Y <- matrix(rnorm(100000), ncol=20)
Z <- matrix(rnorm(20000), ncol=20)

out <- queryExhaustive(Y, query=Z, k=5)
head(out$index)
head(out$distance)

out1 <- queryKmknn(Y, query=Z, k=5)
head(out1$index)
head(out1$distance)

out2 <- queryVptree(Y, query=Z, k=5)
head(out2$index)
head(out2$distance)

out3 <- queryAnnoy(Y, query=Z, k=5)
head(out3$index)
head(out3$distance)

out4 <- queryHnsw(Y, query=Z, k=5)
head(out4$index)
head(out4$distance)


LTLA/BiocNeighbors documentation built on Jan. 14, 2024, 9:46 p.m.