knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(bigANNOY.progress = FALSE) set.seed(20260326)
One of the main goals of bigANNOY is to work comfortably with bigmemory
data that already lives on disk. Instead of forcing a large reference matrix
through dense in-memory copies, the package can build and query Annoy indexes
directly from file-backed big.matrix objects and their descriptors.
This vignette focuses on the most common disk-oriented workflows:
big.matrix query layoutslibrary(bigANNOY) library(bigmemory)
For reproducibility, we will create all backing files inside a temporary directory. In real work this would usually be a project directory or a shared data location.
workspace_dir <- tempfile("bigannoy-filebacked-") dir.create(workspace_dir, recursive = TRUE, showWarnings = FALSE) make_filebacked_matrix <- function(values, type, backingpath, name) { bm <- filebacked.big.matrix( nrow = nrow(values), ncol = ncol(values), type = type, backingfile = sprintf("%s.bin", name), descriptorfile = sprintf("%s.desc", name), backingpath = backingpath ) bm[,] <- values bm }
We will create a reference dataset and store it in a file-backed
big.matrix. The corresponding descriptor file is what lets later R sessions
reattach to the same on-disk data.
ref_dense <- matrix( c( 0.0, 0.0, 5.0, 0.0, 0.0, 5.0, 5.0, 5.0, 9.0, 9.0 ), ncol = 2, byrow = TRUE ) ref_fb <- make_filebacked_matrix( values = ref_dense, type = "double", backingpath = workspace_dir, name = "ref" ) ref_desc <- describe(ref_fb) ref_desc_path <- file.path(workspace_dir, "ref.desc") file.exists(ref_desc_path) dim(ref_fb)
At this point we have:
ref.binref.descbig.matrix object currently attached in this R sessionThe simplest persisted workflow is to build directly from the descriptor file
path instead of from the live big.matrix object. That mirrors how later
sessions typically work.
index_path <- file.path(workspace_dir, "ref.ann") index <- annoy_build_bigmatrix( x = ref_desc_path, path = index_path, n_trees = 25L, metric = "euclidean", seed = 99L, load_mode = "lazy" ) index
This pattern is useful because the build call no longer depends on a particular in-memory object being alive. As long as the descriptor can be reattached, the reference matrix can be used.
For x, query, xpIndex, and xpDistance, bigANNOY accepts several
bigmemory-oriented forms:
big.matrixbig.matrixbig.matrix.descriptor objectFor queries only, a dense numeric matrix is also accepted.
That flexibility matters most in persisted workflows where one part of the pipeline writes descriptors and another part reattaches them later.
Now we will create a file-backed query matrix and search the persisted Annoy index against it.
query_dense <- matrix( c( 0.2, 0.1, 4.7, 5.1 ), ncol = 2, byrow = TRUE ) query_fb <- make_filebacked_matrix( values = query_dense, type = "double", backingpath = workspace_dir, name = "query" ) query_result_big <- annoy_search_bigmatrix( index, query = query_fb, k = 2L, search_k = 100L ) query_result_big$index round(query_result_big$distance, 3)
The query matrix itself is file-backed, but the search call looks the same as
it would for an in-memory big.matrix.
The same persisted query data can be supplied through its descriptor object or through the descriptor file path. This is often the most convenient way to reattach query data across sessions.
query_desc <- describe(query_fb) query_desc_path <- file.path(workspace_dir, "query.desc") query_result_desc <- annoy_search_bigmatrix( index, query = query_desc, k = 2L, search_k = 100L ) query_result_path <- annoy_search_bigmatrix( index, query = query_desc_path, k = 2L, search_k = 100L ) query_result_desc$index query_result_path$index
These should match the result obtained from the live big.matrix query.
identical(query_result_big$index, query_result_desc$index) identical(query_result_big$index, query_result_path$index) all.equal(query_result_big$distance, query_result_desc$distance)
Large search results can be expensive to keep in ordinary R memory. To avoid
that, bigANNOY can stream neighbour ids and distances directly into
destination big.matrix objects.
For file-backed workflows, this means you can keep both the inputs and the outputs on disk.
index_store <- filebacked.big.matrix( nrow = nrow(query_dense), ncol = 2L, type = "integer", backingfile = "nn_index.bin", descriptorfile = "nn_index.desc", backingpath = workspace_dir ) distance_store <- filebacked.big.matrix( nrow = nrow(query_dense), ncol = 2L, type = "double", backingfile = "nn_distance.bin", descriptorfile = "nn_distance.desc", backingpath = workspace_dir ) streamed_result <- annoy_search_bigmatrix( index, query = query_desc, k = 2L, xpIndex = describe(index_store), xpDistance = file.path(workspace_dir, "nn_distance.desc") ) bigmemory::as.matrix(index_store) round(bigmemory::as.matrix(distance_store), 3)
The important practical details are:
xpIndex must be integer-compatiblexpDistance must be double-compatiblen_query x kxpDistance can only be supplied when xpIndex is also suppliedBecause the result matrices are file-backed, they can be reattached later in
the same way as any other bigmemory artifact.
index_store_again <- attach.big.matrix(file.path(workspace_dir, "nn_index.desc")) distance_store_again <- attach.big.matrix(file.path(workspace_dir, "nn_distance.desc")) bigmemory::as.matrix(index_store_again) round(bigmemory::as.matrix(distance_store_again), 3)
That is useful in longer pipelines where one step performs ANN search and a later step consumes the neighbour graph or distance matrix.
bigANNOY also supports separated-column big.matrix layouts. These are not
necessarily file-backed, but they are common in bigmemory workflows and are
worth knowing about because they use a different memory layout from the usual
contiguous matrix case.
query_sep <- big.matrix( nrow = nrow(query_dense), ncol = ncol(query_dense), type = "double", separated = TRUE ) query_sep[,] <- query_dense sep_result <- annoy_search_bigmatrix( index, query = describe(query_sep), k = 2L, search_k = 100L ) sep_result$index round(sep_result$distance, 3)
For the same query values, the separated-column result should match the ordinary file-backed query result.
identical(sep_result$index, query_result_big$index) all.equal(sep_result$distance, query_result_big$distance)
Taken together, the main file-backed pattern looks like this:
big.matrixbig.matrix, a descriptor object, or a
descriptor pathThis is often the most practical way to use bigANNOY in large-data settings,
because every major artifact in the workflow can be reopened later.
.ann file with its .meta sidecar file.n_query x k is too large to hold comfortably in
ordinary R matrices.This vignette covered the main bigmemory persistence features in bigANNOY:
The natural next vignette after this one is Benchmarking Recall and Latency, which shows how to evaluate these workflows against runtime and quality targets.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.