buildAnnoy: Build an Annoy index

View source: R/buildAnnoy.R

buildAnnoyR Documentation

Build an Annoy index

Description

Build an Annoy index and save it to file in preparation for a nearest-neighbors search.

Usage

buildAnnoy(
  X,
  transposed = FALSE,
  ntrees = 50,
  directory = tempdir(),
  search.mult = ntrees,
  fname = tempfile(tmpdir = directory, fileext = ".idx"),
  distance = c("Euclidean", "Manhattan", "Cosine")
)

Arguments

X

A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions).

transposed

Logical scalar indicating whether X is transposed, i.e., rows are variables and columns are data points.

ntrees

Integer scalar specifying the number of trees to build in the index.

directory

String containing the path to the directory in which to save the index file.

search.mult

Numeric scalar specifying the multiplier for the number of points to search.

fname

String containing the path to the index file.

distance

String specifying the type of distance to use.

Details

This function is automatically called by findAnnoy and related functions. However, it can be called directly by the user to save time if multiple queries are to be performed to the same X.

It is advisable to change directory to a location that is amenable to parallel read operations on HPC file systems. Of course, if index files are manually constructed, the user is also responsible for their clean-up after all calculations are completed.

The ntrees parameter controls the trade-off between accuracy and computational work. More trees provide greater accuracy at the cost of more computational work (both in terms of the indexing time and search speed in downstream functions).

The search.mult controls the parameter known as search_k in the original Annoy documentation. Specifically, search_k is defined as k * search.mult where k is the number of nearest neighbors to identify in downstream functions. This represents the number of points to search exhaustively and determines the run-time balance between speed and accuracy. The default search.mult=ntrees is based on the Annoy library defaults. Note that this parameter is not actually used in the index construction itself, and is only included here so that the output index fully parametrizes the search.

Technically, the index construction algorithm is stochastic but, for various logistical reasons, the seed is hard-coded into the C++ code. This means that the results of the Annoy neighbor searches will be fully deterministic for the same inputs, even though the theory provides no such guarantees.

Value

An AnnoyIndex object containing a path to the index file, plus additional parameters for the search.

Author(s)

Aaron Lun

See Also

AnnoyIndex, for details on the output class.

findAnnoy and queryAnnoy, for dependent functions.

Examples

Y <- matrix(rnorm(100000), ncol=20)
out <- buildAnnoy(Y)
out


LTLA/kmknn documentation built on Feb. 5, 2024, 6:03 p.m.