Matching Vectors Based on Euclidean Distance
In zoomerjoin: Superlatively Fast Fuzzy Joins

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(zoomerjoin)

Introduction

The flagship feature of zoomerjoin is are the tidy joins for strings using the Jaccard distance, but zoomerjoin also allows you to join vectors using Euclidean distance. This can be useful for joining addresses or coordinates in space.

Unlike other nearest-neighbor methods such as KD-trees, the joins do not slow down as the dimension of the coordinates increases, so zoomerjoin can be used can be used to find close points in a high-dimensional space (such as word embeddings).

Demonstration

For this demonstration, I create a simulated dataset of 10^5 points distributed uniformly within a 100-dimensional hypercube. I join this to another dataset which is a copy of the first with each point shifted an tiny random amount.

n <- 10^5 # number of data points
d <- 10^2 # dimension

# Create a matrix of 10^6 observations in R^100
X <- matrix(runif(n * d), n, d)
# Second Dataset is a copy of the first with points shifted an infinitesimal
# amount
X_2 <- as.data.frame(X + matrix(rnorm(n * d, 0, .0001), n, d))
X <- as.data.frame(X)

I now want to join these two datasets together. The Euclidean joins take 3 hyperparameters: n_bands, band_width, and r. Which all have to be chosen for the problem domain (although the defaults are generally sensible).

I use the euclidean_probability function in the package to understand the probability that two observations at distance of .01 from each other are indentified as a match at a variety of hyperparameter configurations.

euclidean_probability(.01, n_bands = 5, band_width = 8, r = .25)
euclidean_probability(.1, n_bands = 5, band_width = 8, r = .25)

euclidean_probability(.01, n_bands = 10, band_width = 4, r = .15)
euclidean_probability(.1, n_bands = 10, band_width = 4, r = .15)

euclidean_probability(.01, n_bands = 40, band_width = 8, r = .15)
euclidean_probability(.1, n_bands = 40, band_width = 8, r = .15)

Using n_bands=40, band_width=8, and r=.15 seems to provide a good balance between identifying all true matches (as pairs less than .01 apart are guaranteed to be found) with reducing the number of un-promising comparisons (as pairs greater than .1 apart are unlikely to be compared). I then use the euclidean_inner_join to find all matching pairs across the two datasets:

set.seed(1)
start <- Sys.time()
joined_out <- euclidean_inner_join(
  X,
  X_2,
  threshold = .01,
  n_bands = 40,
  band_width = 8,
  r = .15
)
n_matches <- nrow(joined_out)
time_taken <- Sys.time() - start
print(paste("found", n_matches, "matches in", round(time_taken), "seconds"))

Zoomerjoin is able to easily find all pairs in just under 30s (perhaps longer on the runner that renders the website), even though the points lie in high-dimensional (d=100) space. This makes zoomerjoin a useful tool when trying to join or find matches between datasets of word or document embeddings.