knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(bigANNOY.progress = FALSE) set.seed(20260326)
bigANNOY includes exported benchmark helpers so you can measure three related
things with the same interface:
bigKNN baselineRcppAnnoyThis vignette shows how to use those helpers for both quick one-off runs and small parameter sweeps.
The package currently exports four benchmark functions:
benchmark_annoy_bigmatrix() for one build-and-search configurationbenchmark_annoy_recall_suite() for a grid of n_trees and search_k
settings on the same datasetbenchmark_annoy_vs_rcppannoy() for a direct comparison between the
package's bigmemory workflow and a dense RcppAnnoy baselinebenchmark_annoy_volume_suite() for scaling studies across larger synthetic
data sizesThese helpers can work with:
big.matrix inputs, descriptors, descriptor paths, and external pointersThey can also write summaries to CSV so results can be saved outside the current R session, and the comparison helpers add byte-oriented fields for the reference data, query data, Annoy index file, and total persisted artifacts.
library(bigANNOY)
We will write any temporary benchmark files into a dedicated directory so the workflow is easy to inspect.
bench_dir <- tempfile("bigannoy-benchmark-") dir.create(bench_dir, recursive = TRUE, showWarnings = FALSE) bench_dir
The simplest benchmark call uses synthetic data. This is useful when you want
a quick sense of how build and search times respond to n_trees, search_k,
and the problem dimensions.
single_csv <- file.path(bench_dir, "single.csv") single <- benchmark_annoy_bigmatrix( n_ref = 200L, n_query = 20L, n_dim = 6L, k = 3L, n_trees = 10L, search_k = 50L, exact = FALSE, path_dir = bench_dir, output_path = single_csv, load_mode = "eager" ) single$summary
The returned object contains more than just the summary row.
names(single) single$params single$exact_available
Because exact = FALSE, the benchmark skips the exact bigKNN comparison and
focuses only on the approximate Annoy path.
The benchmark helpers also validate the built Annoy index before measuring the search step. That helps ensure the timing result corresponds to a usable, reopenable index rather than a partially successful build.
single$validation$valid single$validation$checks[, c("check", "passed", "severity")]
The same summary is also written to CSV when output_path is supplied.
read.csv(single_csv, stringsAsFactors = FALSE)
One subtle but important detail is how synthetic data generation works:
x = NULL and query is omitted, the benchmark generates a separate
synthetic query matrixx = NULL and query = NULL is supplied explicitly, the benchmark runs
self-search on the reference matrixThat difference is reflected in the self_search and n_query fields.
external_run <- benchmark_annoy_bigmatrix( n_ref = 120L, n_query = 12L, n_dim = 5L, k = 3L, n_trees = 8L, exact = FALSE, path_dir = bench_dir ) self_run <- benchmark_annoy_bigmatrix( n_ref = 120L, query = NULL, n_dim = 5L, k = 3L, n_trees = 8L, exact = FALSE, path_dir = bench_dir ) shape_cols <- c("self_search", "n_ref", "n_query", "k") rbind( external = external_run[["summary"]][, shape_cols], self = self_run[["summary"]][, shape_cols] )
That distinction matters when you are benchmarking workflows that mirror either training-set neighbour search or truly external query traffic.
For tuning work, a single benchmark point is usually not enough. The suite
helper runs a grid of n_trees and search_k values on the same dataset so
you can compare trade-offs more systematically.
suite_csv <- file.path(bench_dir, "suite.csv") suite <- benchmark_annoy_recall_suite( n_ref = 200L, n_query = 20L, n_dim = 6L, k = 3L, n_trees = c(5L, 10L), search_k = c(-1L, 50L), exact = FALSE, path_dir = bench_dir, output_path = suite_csv, load_mode = "eager" ) suite$summary
Each row corresponds to one (n_trees, search_k) configuration on the same
underlying benchmark dataset.
The saved CSV contains the same summary table.
read.csv(suite_csv, stringsAsFactors = FALSE)
For Euclidean workloads, the benchmark helpers can optionally compare Annoy
results against the exact bigKNN baseline and report:
exact_elapsedrecall_at_kThat comparison is only available when the runtime package bigKNN is
installed.
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) { exact_run <- benchmark_annoy_bigmatrix( n_ref = 150L, n_query = 15L, n_dim = 5L, k = 3L, n_trees = 10L, search_k = 50L, metric = "euclidean", exact = TRUE, path_dir = bench_dir ) exact_run$exact_available exact_run$summary[, c("build_elapsed", "search_elapsed", "exact_elapsed", "recall_at_k")] } else { "Exact baseline example skipped because bigKNN is not installed." }
This is the most direct way to answer the practical question, "How much search speed am I buying, and what recall do I lose in return?"
Synthetic data is convenient, but real benchmarking usually needs real data. Both benchmark helpers can accept user-supplied reference and query inputs.
ref <- matrix(rnorm(80 * 4), nrow = 80, ncol = 4) query <- matrix(rnorm(12 * 4), nrow = 12, ncol = 4) user_run <- benchmark_annoy_bigmatrix( x = ref, query = query, k = 3L, n_trees = 12L, search_k = 40L, exact = FALSE, filebacked = TRUE, path_dir = bench_dir, load_mode = "eager" ) user_run$summary[, c( "filebacked", "self_search", "n_ref", "n_query", "n_dim", "build_elapsed", "search_elapsed" )]
When filebacked = TRUE, dense reference inputs are first converted into a
file-backed big.matrix before the Annoy build starts. That can be useful
when you want the benchmark workflow to resemble the package's real persisted
data path more closely.
When you want to understand the cost of the bigmemory-oriented wrapper
itself, the most useful benchmark is not an exact Euclidean baseline. It is a
direct comparison with plain RcppAnnoy, using the same synthetic dataset, the
same metric, the same n_trees, and the same search_k.
That is what benchmark_annoy_vs_rcppannoy() provides.
compare_csv <- file.path(bench_dir, "compare.csv") compare_run <- benchmark_annoy_vs_rcppannoy( n_ref = 200L, n_query = 20L, n_dim = 6L, k = 3L, n_trees = 10L, search_k = 50L, exact = FALSE, path_dir = bench_dir, output_path = compare_csv, load_mode = "eager" ) compare_run$summary[, c( "implementation", "reference_storage", "n_ref", "n_query", "n_dim", "total_data_bytes", "index_bytes", "build_elapsed", "search_elapsed" )]
This benchmark is useful for a different question from the earlier exact baseline:
benchmark_annoy_bigmatrix() asks how approximate Annoy behaves on a given
dataset and, optionally, how much recall it loses against exact bigKNNbenchmark_annoy_vs_rcppannoy() asks how much overhead or benefit comes from
the package's bigmemory and persistence workflow relative to direct
RcppAnnoyThe output also includes data-volume fields:
ref_bytes: estimated bytes in the reference matrixquery_bytes: estimated bytes in the query matrixtotal_data_bytes: reference plus effective query volumeindex_bytes: bytes in the saved Annoy indexmetadata_bytes: bytes in the sidecar metadata fileartifact_bytes: persisted Annoy artifacts written by the workflowThe generated CSV contains the same comparison table.
read.csv(compare_csv, stringsAsFactors = FALSE)[, c( "implementation", "ref_bytes", "query_bytes", "index_bytes", "metadata_bytes", "artifact_bytes" )]
In practice, the comparison table helps answer two operational questions:
bigANNOY close enough to plain RcppAnnoy on build and search speed for
this workload?A single comparison point is useful, but it does not tell you whether the
wrapper overhead stays modest as the problem gets larger. The volume suite runs
the same bigANNOY versus RcppAnnoy comparison across a grid of synthetic
data sizes.
volume_csv <- file.path(bench_dir, "volume.csv") volume_run <- benchmark_annoy_volume_suite( n_ref = c(200L, 500L), n_query = 20L, n_dim = c(6L, 12L), k = 3L, n_trees = 10L, search_k = 50L, exact = FALSE, path_dir = bench_dir, output_path = volume_csv, load_mode = "eager" ) volume_run$summary[, c( "implementation", "n_ref", "n_dim", "total_data_bytes", "index_bytes", "build_elapsed", "search_elapsed" )]
This kind of table is especially useful when you want to prepare a more formal benchmark note for a package release or for internal performance regression tracking:
bigANNOY versus direct RcppAnnoy gap visible across more
than one benchmark pointThe most useful summary fields are:
build_elapsed: time spent creating the Annoy indexsearch_elapsed: time spent running the search stepexact_elapsed: time spent on the exact Euclidean baseline, when availablerecall_at_k: average overlap with the exact top-k neighboursimplementation: whether the row came from bigANNOY or direct RcppAnnoyn_trees: index quality/size control at build timesearch_k: query effort control at search timeself_search: whether the benchmark searched the reference rows against
themselvesfilebacked: whether dense reference data was converted into a file-backed
big.matrixref_bytes, query_bytes, and index_bytes: the rough data and artifact
volume associated with the benchmarkIn practice:
search_k first when recall is too lown_trees when higher search budgets alone are not enoughsearch_elapsed and recall_at_k together instead of optimizing
either one in isolationbenchmark_annoy_vs_rcppannoy() when you want to reason about package
overhead rather than approximate-versus-exact qualitybenchmark_annoy_volume_suite() when you need a more formal scaling
table for release notes or internal reportsThe package also installs a command-line benchmark script. That is convenient when you want to run a benchmark outside an interactive R session or save CSV output from shell scripts.
The installed path is:
system.file("benchmarks", "benchmark_annoy.R", package = "bigANNOY")
Example single-run command:
Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \ --mode=single \ --n_ref=5000 \ --n_query=500 \ --n_dim=50 \ --k=20 \ --n_trees=100 \ --search_k=5000 \ --load_mode=eager
Example suite command:
Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \ --mode=suite \ --n_ref=5000 \ --n_query=500 \ --n_dim=50 \ --k=20 \ --suite_trees=10,50,100 \ --suite_search_k=-1,2000,10000 \ --output_path=/tmp/bigannoy_suite.csv
Example direct-comparison command:
Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \ --mode=compare \ --n_ref=5000 \ --n_query=500 \ --n_dim=50 \ --k=20 \ --n_trees=100 \ --search_k=5000 \ --load_mode=eager
Example volume-suite command:
Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \ --mode=volume \ --suite_n_ref=2000,5000,10000 \ --suite_n_query=200 \ --suite_n_dim=20,50 \ --k=10 \ --n_trees=50 \ --search_k=1000 \ --output_path=/tmp/bigannoy_volume.csv
A practical tuning workflow usually looks like this:
n_trees by search_k gridbigKNN is availablebigANNOY's benchmark helpers are designed to make performance work part of
the normal package workflow, not a separate ad hoc script:
benchmark_annoy_bigmatrix() for one configurationbenchmark_annoy_recall_suite() for parameter sweepsbenchmark_annoy_vs_rcppannoy() for direct implementation comparisonbenchmark_annoy_volume_suite() for speed and size scaling studiesbigKNNThe next vignette to read after this one is usually Metrics and Tuning, which goes deeper on how to choose metrics and search/build controls.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.