knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(bigANNOY.progress = FALSE) set.seed(20260326)
bigANNOY exposes two kinds of choices that matter in practice:
This vignette walks through both with small concrete examples and then ends with a lightweight tuning workflow you can reuse on your own data.
library(bigANNOY) library(bigmemory)
To make metric behavior easier to see, we will use a tiny reference set with a few deliberately different vector directions and magnitudes.
tune_dir <- tempfile("bigannoy-tuning-") dir.create(tune_dir, recursive = TRUE, showWarnings = FALSE) ref_labels <- c( "unit_x", "double_x", "unit_y", "tilted_x", "unit_z", "diag_xy" ) ref_dense <- matrix( c( 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.8, 0.2, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0 ), ncol = 3, byrow = TRUE ) query_dense <- matrix( c( 1.0, 0.0, 0.0, 0.9, 0.1, 0.0 ), ncol = 3, byrow = TRUE ) ref_big <- as.big.matrix(ref_dense) data.frame( index = seq_along(ref_labels), label = ref_labels, ref_dense, row.names = NULL )
bigANNOY currently supports:
"euclidean""angular""manhattan""dot"The most important rule of thumb is that distances are only directly comparable within the same metric. A Euclidean distance and an angular distance are not on the same scale and should not be interpreted as if they meant the same thing.
Here is the same search performed under all four metrics.
metric_table <- do.call( rbind, lapply(c("euclidean", "angular", "manhattan", "dot"), function(metric) { index_path <- file.path(tune_dir, sprintf("%s.ann", metric)) idx <- annoy_build_bigmatrix( ref_big, path = index_path, metric = metric, n_trees = 20L, seed = 123L, load_mode = "eager" ) res <- annoy_search_bigmatrix( idx, query = query_dense, k = 2L, search_k = 100L ) data.frame( metric = metric, q1_top1 = ref_labels[res$index[1, 1]], q1_distance = round(res$distance[1, 1], 3), q2_top1 = ref_labels[res$index[2, 1]], q2_distance = round(res$distance[2, 1], 3), stringsAsFactors = FALSE ) }) ) metric_table
Even on this toy example, the metric choice changes how rows are ranked.
The practical interpretation is:
"euclidean" when straight-line distance in the original space is what
you care about, and especially when you want the most direct comparison with
bigKNN"angular" when vector direction matters more than magnitude"manhattan" when coordinatewise absolute deviations are a more natural
notion of difference than Euclidean distance"dot" when inner-product style ranking is closer to the scoring rule
you wantFor non-Euclidean metrics, treat the returned distance matrix as the
Annoy-backend distance for that metric rather than as something you can compare
directly to Euclidean values.
The most important build-time controls are:
n_treesseedbuild_threadsblock_sizeload_moden_trees is the main quality-versus-build-cost knob at index build time.
seed makes index construction reproducible. This is especially useful when
you are benchmarking different settings and want to reduce one source of
variation between runs.
build_threads is passed to the native C++ backend.
-1L means "use Annoy's default"block_size controls how many rows are processed per streamed block while
building and searching. This is mostly an execution-behavior knob, not a
quality knob.
load_mode controls session behavior, not search quality:
"lazy" delays opening the native handle until first search"eager" opens the handle immediatelyHere is a simple side-by-side example.
lazy_index <- annoy_build_bigmatrix( ref_big, path = file.path(tune_dir, "lazy.ann"), metric = "euclidean", n_trees = 8L, seed = 123L, load_mode = "lazy" ) eager_index <- annoy_build_bigmatrix( ref_big, path = file.path(tune_dir, "eager.ann"), metric = "euclidean", n_trees = 25L, seed = 123L, load_mode = "eager" ) c( lazy_loaded = annoy_is_loaded(lazy_index), eager_loaded = annoy_is_loaded(eager_index) )
The most important search-time controls are:
ksearch_kblock_sizeprefaultk is simply the number of neighbours you want returned. It changes the shape
of the result and the amount of work the search must do.
search_k is the main quality-versus-search-cost knob at query time.
-1L lets Annoy use its default search budgetWhen you start tuning, this is usually the first knob to increase.
At search time, block_size controls how many query rows are processed per
block. As with build-time blocking, this affects execution behavior more than
quality.
prefault controls how the persisted Annoy index is loaded by the native
backend. It can be useful for repeated search workloads on some platforms, but
it is not guaranteed to have the same effect everywhere.
reopened <- annoy_open_index( eager_index$path, prefault = TRUE, load_mode = "eager" ) result <- annoy_search_bigmatrix( reopened, query = query_dense, k = 2L, search_k = 100L, prefault = TRUE )
Because prefault depends on platform and OS support, it is best treated as a
workload-specific optimization rather than as a universal default.
Once you know which metric is appropriate, the next question is usually how far
to push n_trees and search_k.
The benchmark helpers are the easiest way to study that trade-off.
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) { tuning_suite <- benchmark_annoy_recall_suite( n_ref = 200L, n_query = 20L, n_dim = 6L, k = 3L, n_trees = c(5L, 20L), search_k = c(-1L, 50L, 200L), metric = "euclidean", exact = TRUE, path_dir = tune_dir ) tuning_suite$summary[, c( "n_trees", "search_k", "build_elapsed", "search_elapsed", "recall_at_k" )] } else { tuning_suite <- benchmark_annoy_recall_suite( n_ref = 200L, n_query = 20L, n_dim = 6L, k = 3L, n_trees = c(5L, 20L), search_k = c(-1L, 50L, 200L), metric = "euclidean", exact = FALSE, path_dir = tune_dir ) tuning_suite$summary[, c( "n_trees", "search_k", "build_elapsed", "search_elapsed" )] }
That table is the practical center of most tuning work:
bigANNOY also exposes a few package options that are useful in repeated
tuning sessions.
list( block_size_default = getOption("bigANNOY.block_size", 1024L), progress_default = getOption("bigANNOY.progress", FALSE), backend_default = getOption("bigANNOY.backend", "cpp") )
In practice:
options(bigANNOY.block_size = ...) when you want a session-wide block
size defaultoptions(bigANNOY.progress = TRUE) when you want progress messages
during long runsA useful workflow is:
n_trees and a modest search_kn_trees by search_ksearch_k first if quality is too lowblock_size, load_mode, and prefault only after the main
quality-versus-latency trade-off is understoodThe most important ideas in bigANNOY tuning are:
n_trees mostly controls build-time quality investmentsearch_k mostly controls query-time quality investmentblock_size, load_mode, and prefault mostly affect execution behavior
rather than neighbour semanticsbigKNNThe next vignette after this one is usually Validation and Sharing Indexes, which focuses on sidecar metadata, persisted files, and safe reuse across sessions.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.