bigANNOY Versus bigKNN

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(bigANNOY.progress = FALSE)
set.seed(20260326)

bigANNOY and bigKNN are meant to complement each other, not compete for the same role.

That makes them a natural pair:

This vignette explains how to think about that split and how to compare the two packages in practice.

The Core Difference

At a high level, the packages answer slightly different questions.

bigKNN asks:

bigANNOY asks:

That distinction has consequences:

When To Use Which Package

Use bigKNN when:

Use bigANNOY when:

In other words:

Shared Result Shape

One of the most useful design choices in bigANNOY is that its result object is intentionally aligned with bigKNN.

The returned components are conceptually parallel:

For bigANNOY, exact = FALSE and backend = "annoy".

That shared shape matters because it makes these workflows much simpler:

Load the Packages You Need

This vignette always uses bigANNOY. The bigKNN parts are optional and only run when bigKNN is installed.

library(bigANNOY)
library(bigmemory)

A Small Comparison Dataset

We will create a small reference matrix and a separate query matrix. This is large enough to show the workflow clearly without making the vignette slow.

compare_dir <- tempfile("bigannoy-vs-bigknn-")
dir.create(compare_dir, recursive = TRUE, showWarnings = FALSE)

ref_dense <- matrix(rnorm(120 * 6), nrow = 120, ncol = 6)
query_dense <- matrix(rnorm(15 * 6), nrow = 15, ncol = 6)

ref_big <- as.big.matrix(ref_dense)
dim(ref_big)
dim(query_dense)

Approximate Search with bigANNOY

bigANNOY first builds an Annoy index and then searches that persisted index.

annoy_index <- annoy_build_bigmatrix(
  ref_big,
  path = file.path(compare_dir, "ref.ann"),
  metric = "euclidean",
  n_trees = 20L,
  seed = 123L,
  load_mode = "eager"
)

approx_result <- annoy_search_bigmatrix(
  annoy_index,
  query = query_dense,
  k = 5L,
  search_k = 100L
)

names(approx_result)
approx_result$exact
approx_result$backend
approx_result$index[1:3, ]
round(approx_result$distance[1:3, ], 3)

This is the standard approximate Euclidean workflow in bigANNOY.

Exact Search with bigKNN When Available

If bigKNN is installed, the exact Euclidean comparison is straightforward because the result structure is deliberately similar.

if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
  knn_bigmatrix <- get("knn_bigmatrix", envir = asNamespace("bigKNN"))

  exact_result <- knn_bigmatrix(
    ref_big,
    query = query_dense,
    k = 5L,
    metric = "euclidean",
    block_size = 64L,
    exclude_self = FALSE
  )

  list(
    names = names(exact_result),
    exact = exact_result$exact,
    backend = exact_result$backend,
    index_head = exact_result$index[1:3, ],
    distance_head = round(exact_result$distance[1:3, ], 3)
  )
} else {
  "bigKNN is not installed in this session, so the exact comparison example is skipped."
}

The exact result uses the same high-level structure, but now exact is expected to be TRUE and the backend identifies the exact search path.

What Does "Aligned Result Shape" Buy You?

The aligned result shape means you can compare exact and approximate neighbour sets directly when metric = "euclidean" and both were run with the same k.

When bigKNN is available, a simple overlap-style recall comparison looks like this:

if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
  knn_bigmatrix <- get("knn_bigmatrix", envir = asNamespace("bigKNN"))

  exact_result <- knn_bigmatrix(
    ref_big,
    query = query_dense,
    k = 5L,
    metric = "euclidean",
    block_size = 64L,
    exclude_self = FALSE
  )

  recall_at_5 <- mean(vapply(seq_len(nrow(query_dense)), function(i) {
    length(intersect(approx_result$index[i, ], exact_result$index[i, ])) / 5
  }, numeric(1L)))

  recall_at_5
} else {
  "Recall example skipped because bigKNN is not installed."
}

That is the core evaluation pattern:

Why bigANNOY Still Matters When bigKNN Exists

If exact search exists, why use approximate search at all?

Because operationally, the best answer is not always the exact answer.

bigANNOY adds capabilities that solve a different problem:

So the two packages fit a common progression:

  1. use bigKNN to establish correctness and a benchmark baseline
  2. use bigANNOY to explore how much latency you can save
  3. compare recall against the exact baseline
  4. choose the operating point that is acceptable for the application

Benchmark Integration

The benchmark helpers in bigANNOY already support this pairing directly for Euclidean workloads. If bigKNN is available, they can report exact timing and recall automatically.

bench <- benchmark_annoy_bigmatrix(
  n_ref = 200L,
  n_query = 20L,
  n_dim = 6L,
  k = 5L,
  n_trees = 20L,
  search_k = 100L,
  metric = "euclidean",
  exact = length(find.package("bigKNN", quiet = TRUE)) > 0L,
  path_dir = compare_dir,
  load_mode = "eager"
)

bench$summary[, c(
  "metric",
  "n_trees",
  "search_k",
  "build_elapsed",
  "search_elapsed",
  "exact_elapsed",
  "recall_at_k"
)]

This is usually the easiest way to decide whether an approximate search configuration is worth adopting.

A Practical Decision Framework

Here is a simple way to decide between the two packages for a Euclidean workflow.

Start with bigKNN when:

Move toward bigANNOY when:

Keep both in the workflow when:

Important Boundaries

There are also a few boundaries worth keeping clear:

That last point is easy to forget. The question is not whether approximate search is exact. The question is whether the approximation quality is good enough for the application you care about.

Recap

The best way to think about the pair is:

If you are beginning a new Euclidean workflow, a strong default is to start with bigKNN as the baseline, then move to bigANNOY once latency, scale, or persisted-index workflows become the limiting factor.



Try the bigANNOY package in your browser

Any scripts or data that you put into this service are public.

bigANNOY documentation built on April 1, 2026, 9:07 a.m.