In LTLA/kmknn: Nearest Neighbor Detection for Bioconductor Packages

require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
library(BiocStyle)
self <- Biocpkg("BiocNeighbors")
set.seed(99999)

Overview

The r self package implements a variety of nearest-neighbor (NN) search algorithms with a consistent interface. This includes exact search methods like vantage point trees, k-means k-nearest neighbors, and an exhaustive brute-force search, as well as approximate methods like Approximate Nearest Neighbors Oh Yeah (Annoy) and hierarchical navigable small worlds (HNSW). We provides methods to search nearest neighbors within a dataset, query across datasets, and (for some algorithms) find all neighbors within a certain distance.

Finding nearest neighbors

The findKNN() function will find the k-nearest neighbors within a dataset, returning one matrix of neighbor identities and another matrix of (sorted) distances to each neighbor.

# Mocking up a matrix with 'nobs' points in the rows
# and 'ndim' dimensions in the columns.
nobs <- 1000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)

library(BiocNeighbors)
fout <- findKNN(data, k=10)
str(fout)

Each row of the index matrix corresponds to a row in data and contains the row indices in data that are its nearest neighbors. For example, the 3rd point in data has the following nearest neighbors with their (sorted) distances:

fout$index[3,]
fout$distance[3,]

Users can easily change the algorithm by setting the BNPARAM= argument. For example, the code below switches to the Annoy algorithm for a faster, less accurate search.

aout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
str(aout)

The same mechanism can be used to change the distance metric from the Euclidean default.

vout <- findKNN(data, k=10, BNPARAM=VptreeParam(distance="Manhattan"))
str(vout)

If the number of neighbors differs for each point, we can supply an integer vector to k= instead. This yields a list of vectors containing the neighbors and their distances to each point.

var.k <- sample(10, nrow(data), replace=TRUE)

# use I() to distinguish between scalar and length-1 vectors.
var.out <- findKNN(data, k=I(var.k))

head(var.out$index)
head(var.out$distance)

queryKNN() is a related function that will find the k-nearest neighbors in one dataset based on query points in another dataset. Here, the rows of the output matrices correspond to rows of our query matrix.

nquery <- 100
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)

qout <- queryKNN(data, query, k=5)
str(qout)

When performing multiple calls to findKNN() or queryKNN() on the same data, advanced users may prefer to build the index first with buildIndex(). This can be efficiently reused without repeating the index construction in each call.

prebuilt <- buildIndex(data)
out1 <- findKNN(prebuilt, k=5)
out2 <- queryKNN(prebuilt, query, k=5)

Finding all neighbors within range

In some applications, we need to find all neighboring points within a certain distance threshold. This is achieved using the findNeighbors() function:

nobs <- 8000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)

fout <- findNeighbors(data, threshold=1)
head(fout$index)
head(fout$distance)

Each entry of the index list corresponds to a point in data and contains the row indices in data that are within threshold. For example, the 3rd point in data has the following neighbors with the associated distances:

fout$index[[3]]
fout$distance[[3]]

Again, we can switch algorithms by specifying BNPARAM=. However, keep in mind that not all algorithms (particularly the approximate methods) support this range-based search.

vparam <- VptreeParam(distance="Manhattan")
vout <- findNeighbors(data, threshold=1, BNPARAM=vparam)

queryNeighbors() is a related function that will identify all points within a certain distance of a query point. Each entry of the returned index and distance corresponds to a row of query and describes its neighbors in data.

nquery <- 100
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryNeighbors(data, query, threshold=1)
head(qout$index)

As described above, advanced users can built the index first before repeated calls to findNeighbors() and queryNeighbors().

prebuilt <- buildIndex(data)
out1 <- findNeighbors(prebuilt, threshold=1)
out2 <- queryNeighbors(prebuilt, query,threshold=1)

If only the number of neighbors is of interest, we can set get.index=FALSE and get.distance=FALSE in the findNeighbors()/queryNeighbors() call. This will count the number of neighbors without storing their identities/distances for greater memory efficiency.

num <- findNeighbors(data, threshold=1, get.index=FALSE, get.distance=FALSE)
head(num)

Usage in downstream C++ libraries

R/Bioconductor package developers can use r self to perform nearest-neighbor searches in their own C++ code. This uses the interfaces in the knncolle library to abstract away the underlying search methods. First, we add some linking instructions in the DESCRIPTION to make the header files available:

LinkingTo: Rcpp, assorthead, BiocNeighbors

This includes the BiocNeighbors.h file that implements the BiocNeigbors::Prebuilt class:

system.file("include", "BiocNeighbors.h", package="BiocNeighbors")

When a buildIndex() method returns an external pointer, that pointer is expected to refer to an instance of a BiocNeighbors::Prebuilt. Thus, downstream code just needs to accept a pointer and cast it to a BiocNeighbors::Prebuilt for use in nearest-neighbor searches. (The same approach can be applied to obtain a BiocNeighbors::Builder for construction of new indices.)

SEXP do_something(SEXP raw_ptr, int k) {
    BiocNeighbors::PrebuiltPointer ptr(raw_ptr);
    const auto& prebuilt = ptr->index;
    auto searcher = prebuilt->initialize();

    std::vector<int> indices;
    std::vector<double> distances;
    searcher->search(0, k, &indices, &distances);

    return Rcpp::List::create(
        Rcpp::IntegerVector(indices.begin(), indices.end()),
        Rcpp::IntegerVector(distances.begin(), distances.end())
    );
}

Some extra work is required for cosine-normalized indices, which is indicated by ptr->cosine == true. Any query data should also be L2-normalized before use in Searcher::search() or Searcher::search_all(). (This is not required for overloads that accept an index for searching within a dataset, only for queries between datasets.)

Extending to new methods

Introduction

Developers can extend the r self framework to more algorithms through the creation of new BiocNeighborParam classes. Users can then switch between algorithms by simply changing the BNPARAM= argument in their calls to findKNN(), etc. In general, we recommend putting these extensions in a new R/Bioconductor package that adds more methods to the r self generics. Developers can choose to write pure R extensions or they can write them partially in C++.

In R

Each extension package should:

Implement a SomeNewParam S4 class that inherits from BiocNeighborParam.
Implement a SomeNewIndex S4 class that inherits from BiocNeighborIndex and contains an arbitrary index structure.
Implement a buildIndex() S4 method, dispatching on a matrix type for X= and a SomeNewParam type for BNPARAM=. This should return an instance of SomeNewIndex.
Implement two methods for each of findKNN() and queryKNN() (and optionally findNeighbors() queryNeighbors()). One method should dispatch on a matrix type for X= and a SomeNewParam type for BNPARAM=. The other method should dispatch on a prebuilt SomeNewIndex type for X= and an ANY type for BNPARAM= (as the BNPARAM= is ignored for prebuilt indices).

In C++

Alternatively, each extension package should:

Implement C++ classes that inherit from the Builder, Prebuilt and Searcher interfaces in the knncolle library. This requires LinkingTo: assorthead, BiocNeighbors in the DESCRIPTION.
Implement a SomeNewParam S4 class that inherits from BiocNeighborParam.
Implement a SomeNewIndex S4 class that inherits from BiocNeighborGenericIndex. This should contain a ptr slot for an external pointer and a names slot for the observation names.
Implement a defineBuilder() method, dispatching on a matrix type for X= and a SomeNewParam type for BNPARAM=. This should call into C++ and return a list containing builder, an external pointer to an instance of a BiocNeighbors::Builder (see definition in BiocNeighbors.h); and class, the constructor for the new SomeNewIndex class.

No new methods are required for findKNN(), queryNeighbors(), etc. as a BiocNeighborGeneric method is already available for each generic. This approach also allows the new method to be used in C++ code of downstream packages that accept a BiocNeighbors::Prebuilt instance.