knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(bigANNOY.progress = FALSE) set.seed(20260326)
bigANNOY is an approximate nearest-neighbour package for
bigmemory::big.matrix data. It builds a persisted Annoy index from a
reference matrix, searches that index with either self-search or external
queries, and returns results in a shape aligned with bigKNN.
This vignette walks through the first workflow most users need:
The examples are intentionally small, but the same API is designed for larger
file-backed big.matrix inputs.
library(bigANNOY) library(bigmemory)
bigANNOY is built around bigmemory::big.matrix, so we will start from a
dense matrix and convert it into a big.matrix.
ref_dense <- matrix( c( 0.0, 0.1, 0.2, 0.3, 0.1, 0.0, 0.1, 0.2, 0.2, 0.1, 0.0, 0.1, 1.0, 1.1, 1.2, 1.3, 1.1, 1.0, 1.1, 1.2, 1.2, 1.1, 1.0, 1.1, 3.0, 3.1, 3.2, 3.3, 3.1, 3.0, 3.1, 3.2 ), ncol = 4, byrow = TRUE ) ref_big <- as.big.matrix(ref_dense) dim(ref_big)
The reference matrix has r nrow(ref_dense) rows and r ncol(ref_dense)
columns. Each row is a candidate neighbour in the final search results.
annoy_build_bigmatrix() streams the reference rows into a persisted Annoy
index and writes a sidecar metadata file next to it.
index_path <- tempfile(fileext = ".ann") index <- annoy_build_bigmatrix( ref_big, path = index_path, n_trees = 20L, metric = "euclidean", seed = 123L, load_mode = "lazy" ) index
A few details are worth noticing:
index$pathindex$metadata_pathload_mode = "lazy" means the object is initially metadata-onlyYou can check the current loaded state directly.
annoy_is_loaded(index)
With query = NULL, annoy_search_bigmatrix() searches the indexed reference
rows against themselves. In self-search mode, the nearest neighbour for each
row is another row, not the row itself.
self_result <- annoy_search_bigmatrix( index, k = 2L, search_k = 100L ) self_result$index round(self_result$distance, 3)
Because the first search loads the lazy index, the handle is now available for reuse.
annoy_is_loaded(index)
The result object follows the same high-level shape as bigKNN:
str(self_result, max.level = 1)
In particular:
index is a 1-based integer matrixdistance is a double matrixk, metric, n_ref, and n_query describe the searchexact is always FALSE for bigANNOYbackend is "annoy"External queries are often the more common workflow in practice. Here we build a small dense query matrix with rows close to the first, middle, and final clusters in the reference data.
query_dense <- matrix( c( 0.05, 0.05, 0.15, 0.25, 1.05, 1.05, 1.10, 1.25, 3.05, 3.05, 3.15, 3.25 ), ncol = 4, byrow = TRUE ) query_result <- annoy_search_bigmatrix( index, query = query_dense, k = 3L, search_k = 100L ) query_result$index round(query_result$distance, 3)
The three query rows each return three approximate neighbours from the indexed reference matrix. For small examples like this one, the results will typically look exact, but the important point is that the API stays the same for larger problems where approximate search is preferable.
Two arguments matter most when you begin tuning:
n_trees controls index quality and index size at build timesearch_k controls search effort at query timeAs a starting point:
search_k first if recall looks too lown_trees when query-time tuning alone is not enoughmetric = "euclidean" when you want the most direct comparison with
bigKNNThe package also supports "angular", "manhattan", and "dot" metrics,
but Euclidean is usually the easiest place to begin.
For larger workloads, you may not want to keep neighbour matrices in ordinary
R memory. bigANNOY can write directly into destination big.matrix objects.
index_out <- big.matrix(nrow(query_dense), 2L, type = "integer") distance_out <- big.matrix(nrow(query_dense), 2L, type = "double") streamed <- annoy_search_bigmatrix( index, query = query_dense, k = 2L, xpIndex = index_out, xpDistance = distance_out ) bigmemory::as.matrix(index_out) round(bigmemory::as.matrix(distance_out), 3)
The returned object still reports the same metadata, but the actual neighbour
matrices live in the destination big.matrix containers.
One of the main v3 improvements is explicit index lifecycle support. You can close a loaded handle, reopen the same index from disk, and validate its metadata before reuse.
annoy_close_index(index) annoy_is_loaded(index) reopened <- annoy_open_index(index$path, load_mode = "eager") annoy_is_loaded(reopened)
Validation checks the recorded metadata against the current Annoy file and can also verify that the index loads successfully.
validation <- annoy_validate_index(reopened, strict = TRUE, load = TRUE) validation$valid validation$checks[, c("check", "passed", "severity")]
This is especially helpful when you want to reuse an index across sessions or
share the .ann file and its .meta sidecar with someone else.
For the quick start above we used:
big.matrix referencebig.matrix destinations for streamed outputsThe package also accepts:
big.matrix objectsbig.matrix descriptor objectsquery = NULL for self-searchThat broader file-backed workflow is covered in the dedicated vignette on
bigmemory persistence and descriptors.
You have now seen the full first-run workflow:
big.matrix referencebig.matrix objects when neededFrom here, the most useful next steps are:
benchmark_annoy_bigmatrix() and
benchmark_annoy_recall_suite()Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.