buildAnnoy | R Documentation |
Build an Annoy index and save it to file in preparation for a nearest-neighbors search.
buildAnnoy(
X,
transposed = FALSE,
ntrees = 50,
directory = tempdir(),
search.mult = ntrees,
fname = tempfile(tmpdir = directory, fileext = ".idx"),
distance = c("Euclidean", "Manhattan", "Cosine")
)
X |
A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions). |
transposed |
Logical scalar indicating whether |
ntrees |
Integer scalar specifying the number of trees to build in the index. |
directory |
String containing the path to the directory in which to save the index file. |
search.mult |
Numeric scalar specifying the multiplier for the number of points to search. |
fname |
String containing the path to the index file. |
distance |
String specifying the type of distance to use. |
This function is automatically called by findAnnoy
and related functions.
However, it can be called directly by the user to save time if multiple queries are to be performed to the same X
.
It is advisable to change directory
to a location that is amenable to parallel read operations on HPC file systems.
Of course, if index files are manually constructed, the user is also responsible for their clean-up after all calculations are completed.
The ntrees
parameter controls the trade-off between accuracy and computational work.
More trees provide greater accuracy at the cost of more computational work (both in terms of the indexing time and search speed in downstream functions).
The search.mult
controls the parameter known as search_k
in the original Annoy documentation.
Specifically, search_k
is defined as k * search.mult
where k
is the number of nearest neighbors to identify in downstream functions.
This represents the number of points to search exhaustively and determines the run-time balance between speed and accuracy.
The default search.mult=ntrees
is based on the Annoy library defaults.
Note that this parameter is not actually used in the index construction itself, and is only included here so that the output index fully parametrizes the search.
Technically, the index construction algorithm is stochastic but, for various logistical reasons, the seed is hard-coded into the C++ code. This means that the results of the Annoy neighbor searches will be fully deterministic for the same inputs, even though the theory provides no such guarantees.
An AnnoyIndex object containing a path to the index file, plus additional parameters for the search.
Aaron Lun
AnnoyIndex, for details on the output class.
findAnnoy
and queryAnnoy
, for dependent functions.
Y <- matrix(rnorm(100000), ncol=20)
out <- buildAnnoy(Y)
out
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.