Metrics and Tuning
In bigANNOY: Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(bigANNOY.progress = FALSE)
set.seed(20260326)

bigANNOY exposes two kinds of choices that matter in practice:

the metric, which defines what "near" means
the tuning controls, which trade build cost, search cost, and search quality against one another

This vignette walks through both with small concrete examples and then ends with a lightweight tuning workflow you can reuse on your own data.

Load the Packages

library(bigANNOY)
library(bigmemory)

A Small Dataset for Metric Comparisons

To make metric behavior easier to see, we will use a tiny reference set with a few deliberately different vector directions and magnitudes.

tune_dir <- tempfile("bigannoy-tuning-")
dir.create(tune_dir, recursive = TRUE, showWarnings = FALSE)

ref_labels <- c(
  "unit_x",
  "double_x",
  "unit_y",
  "tilted_x",
  "unit_z",
  "diag_xy"
)

ref_dense <- matrix(
  c(
    1.0, 0.0, 0.0,
    2.0, 0.0, 0.0,
    0.0, 1.0, 0.0,
    0.8, 0.2, 0.0,
    0.0, 0.0, 1.0,
    1.0, 1.0, 0.0
  ),
  ncol = 3,
  byrow = TRUE
)

query_dense <- matrix(
  c(
    1.0, 0.0, 0.0,
    0.9, 0.1, 0.0
  ),
  ncol = 3,
  byrow = TRUE
)

ref_big <- as.big.matrix(ref_dense)

data.frame(
  index = seq_along(ref_labels),
  label = ref_labels,
  ref_dense,
  row.names = NULL
)

Supported Metrics

bigANNOY currently supports:

"euclidean"
"angular"
"manhattan"
"dot"

The most important rule of thumb is that distances are only directly comparable within the same metric. A Euclidean distance and an angular distance are not on the same scale and should not be interpreted as if they meant the same thing.

Compare Metrics on the Same Queries

Here is the same search performed under all four metrics.

metric_table <- do.call(
  rbind,
  lapply(c("euclidean", "angular", "manhattan", "dot"), function(metric) {
    index_path <- file.path(tune_dir, sprintf("%s.ann", metric))

    idx <- annoy_build_bigmatrix(
      ref_big,
      path = index_path,
      metric = metric,
      n_trees = 20L,
      seed = 123L,
      load_mode = "eager"
    )

    res <- annoy_search_bigmatrix(
      idx,
      query = query_dense,
      k = 2L,
      search_k = 100L
    )

    data.frame(
      metric = metric,
      q1_top1 = ref_labels[res$index[1, 1]],
      q1_distance = round(res$distance[1, 1], 3),
      q2_top1 = ref_labels[res$index[2, 1]],
      q2_distance = round(res$distance[2, 1], 3),
      stringsAsFactors = FALSE
    )
  })
)

metric_table

Even on this toy example, the metric choice changes how rows are ranked.

The practical interpretation is:

use "euclidean" when straight-line distance in the original space is what you care about, and especially when you want the most direct comparison with bigKNN
use "angular" when vector direction matters more than magnitude
use "manhattan" when coordinatewise absolute deviations are a more natural notion of difference than Euclidean distance
use "dot" when inner-product style ranking is closer to the scoring rule you want

For non-Euclidean metrics, treat the returned distance matrix as the Annoy-backend distance for that metric rather than as something you can compare directly to Euclidean values.

Build-Time Controls

The most important build-time controls are:

n_trees
seed
build_threads
block_size
load_mode

n_trees

n_trees is the main quality-versus-build-cost knob at index build time.

more trees usually improve search quality
more trees usually increase build time and index size
very small trees are useful for quick experiments but often not for final production settings

seed

seed makes index construction reproducible. This is especially useful when you are benchmarking different settings and want to reduce one source of variation between runs.

build_threads

build_threads is passed to the native C++ backend.

-1L means "use Annoy's default"
positive integers request an explicit build-thread count
the debug-only R backend ignores this control

block_size

block_size controls how many rows are processed per streamed block while building and searching. This is mostly an execution-behavior knob, not a quality knob.

smaller blocks can reduce transient memory pressure
larger blocks can reduce overhead in some workloads

load_mode

load_mode controls session behavior, not search quality:

"lazy" delays opening the native handle until first search
"eager" opens the handle immediately

Here is a simple side-by-side example.

lazy_index <- annoy_build_bigmatrix(
  ref_big,
  path = file.path(tune_dir, "lazy.ann"),
  metric = "euclidean",
  n_trees = 8L,
  seed = 123L,
  load_mode = "lazy"
)

eager_index <- annoy_build_bigmatrix(
  ref_big,
  path = file.path(tune_dir, "eager.ann"),
  metric = "euclidean",
  n_trees = 25L,
  seed = 123L,
  load_mode = "eager"
)

c(
  lazy_loaded = annoy_is_loaded(lazy_index),
  eager_loaded = annoy_is_loaded(eager_index)
)

Query-Time Controls

The most important search-time controls are:

k
search_k
block_size
prefault

k

k is simply the number of neighbours you want returned. It changes the shape of the result and the amount of work the search must do.

search_k

search_k is the main quality-versus-search-cost knob at query time.

larger values usually improve search quality
larger values usually increase search time
-1L lets Annoy use its default search budget

When you start tuning, this is usually the first knob to increase.

block_size

At search time, block_size controls how many query rows are processed per block. As with build-time blocking, this affects execution behavior more than quality.

prefault

prefault controls how the persisted Annoy index is loaded by the native backend. It can be useful for repeated search workloads on some platforms, but it is not guaranteed to have the same effect everywhere.

reopened <- annoy_open_index(
  eager_index$path,
  prefault = TRUE,
  load_mode = "eager"
)

result <- annoy_search_bigmatrix(
  reopened,
  query = query_dense,
  k = 2L,
  search_k = 100L,
  prefault = TRUE
)

Because prefault depends on platform and OS support, it is best treated as a workload-specific optimization rather than as a universal default.

Use the Benchmark Helpers to Tune n_trees and search_k

Once you know which metric is appropriate, the next question is usually how far to push n_trees and search_k.

The benchmark helpers are the easiest way to study that trade-off.

if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
  tuning_suite <- benchmark_annoy_recall_suite(
    n_ref = 200L,
    n_query = 20L,
    n_dim = 6L,
    k = 3L,
    n_trees = c(5L, 20L),
    search_k = c(-1L, 50L, 200L),
    metric = "euclidean",
    exact = TRUE,
    path_dir = tune_dir
  )

  tuning_suite$summary[, c(
    "n_trees",
    "search_k",
    "build_elapsed",
    "search_elapsed",
    "recall_at_k"
  )]
} else {
  tuning_suite <- benchmark_annoy_recall_suite(
    n_ref = 200L,
    n_query = 20L,
    n_dim = 6L,
    k = 3L,
    n_trees = c(5L, 20L),
    search_k = c(-1L, 50L, 200L),
    metric = "euclidean",
    exact = FALSE,
    path_dir = tune_dir
  )

  tuning_suite$summary[, c(
    "n_trees",
    "search_k",
    "build_elapsed",
    "search_elapsed"
  )]
}

That table is the practical center of most tuning work:

if recall is available, compare it against search time
if recall is not available yet, compare build and search timing first
only benchmark metrics against each other when those metrics make sense for the same modelling problem

Package-Level Defaults

bigANNOY also exposes a few package options that are useful in repeated tuning sessions.

list(
  block_size_default = getOption("bigANNOY.block_size", 1024L),
  progress_default = getOption("bigANNOY.progress", FALSE),
  backend_default = getOption("bigANNOY.backend", "cpp")
)

In practice:

set options(bigANNOY.block_size = ...) when you want a session-wide block size default
set options(bigANNOY.progress = TRUE) when you want progress messages during long runs
keep the native C++ backend as the default for real performance work

A Practical Tuning Pattern

A useful workflow is:

choose the metric that best matches the meaning of similarity in your data
start with moderate n_trees and a modest search_k
benchmark a small grid of n_trees by search_k
increase search_k first if quality is too low
rebuild with more trees when higher search budgets alone are not enough
revisit block_size, load_mode, and prefault only after the main quality-versus-latency trade-off is understood

Recap

The most important ideas in bigANNOY tuning are:

metric choice comes first
n_trees mostly controls build-time quality investment
search_k mostly controls query-time quality investment
block_size, load_mode, and prefault mostly affect execution behavior rather than neighbour semantics
Euclidean tuning is the easiest place to start when you want an exact baseline with bigKNN

The next vignette after this one is usually Validation and Sharing Indexes, which focuses on sidecar metadata, persisted files, and safe reuse across sessions.

Any scripts or data that you put into this service are public.

bigANNOY documentation built on April 1, 2026, 9:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bigANNOY
Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy

Metrics and Tuning
In bigANNOY: Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy

Load the Packages

A Small Dataset for Metric Comparisons

Supported Metrics

Compare Metrics on the Same Queries

Build-Time Controls

n_trees

seed

build_threads

block_size

load_mode

Query-Time Controls

k

search_k

block_size

prefault

Use the Benchmark Helpers to Tune n_trees and search_k

Package-Level Defaults

A Practical Tuning Pattern

Recap

Try the bigANNOY package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

bigANNOY Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy

Metrics and Tuning In bigANNOY: Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy

Load the Packages

A Small Dataset for Metric Comparisons

Supported Metrics

Compare Metrics on the Same Queries

Build-Time Controls

n_trees

seed

build_threads

block_size

load_mode

Query-Time Controls

k

search_k

block_size

prefault

Use the Benchmark Helpers to Tune n_trees and search_k

Package-Level Defaults

A Practical Tuning Pattern

Recap

Try the bigANNOY package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

bigANNOY
Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy

Metrics and Tuning
In bigANNOY: Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy