knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(zoomerjoin)
The flagship feature of zoomerjoin is are the tidy joins for strings using the Jaccard distance, but zoomerjoin also allows you to join vectors using Euclidean distance. This can be useful for joining addresses or coordinates in space.
Unlike other nearest-neighbor methods such as KD-trees, the joins do not slow down as the dimension of the coordinates increases, so zoomerjoin can be used can be used to find close points in a high-dimensional space (such as word embeddings).
For this demonstration, I create a simulated dataset of 10^5 points distributed uniformly within a 100-dimensional hypercube. I join this to another dataset which is a copy of the first with each point shifted an tiny random amount.
n <- 10^5 # number of data points d <- 10^2 # dimension # Create a matrix of 10^6 observations in R^100 X <- matrix(runif(n * d), n, d) # Second Dataset is a copy of the first with points shifted an infinitesimal # amount X_2 <- as.data.frame(X + matrix(rnorm(n * d, 0, .0001), n, d)) X <- as.data.frame(X)
I now want to join these two datasets together. The Euclidean joins take 3
hyperparameters: n_bands
, band_width
, and r
. Which all have to be chosen
for the problem domain (although the defaults are generally sensible).
I use the euclidean_probability
function in the package to understand the
probability that two observations at distance of .01 from each other are
indentified as a match at a variety of hyperparameter configurations.
euclidean_probability(.01, n_bands = 5, band_width = 8, r = .25) euclidean_probability(.1, n_bands = 5, band_width = 8, r = .25) euclidean_probability(.01, n_bands = 10, band_width = 4, r = .15) euclidean_probability(.1, n_bands = 10, band_width = 4, r = .15) euclidean_probability(.01, n_bands = 40, band_width = 8, r = .15) euclidean_probability(.1, n_bands = 40, band_width = 8, r = .15)
Using n_bands=40
, band_width=8
, and r=.15
seems to provide a good balance
between identifying all true matches (as pairs less than .01 apart are
guaranteed to be found) with reducing the number of un-promising comparisons
(as pairs greater than .1 apart are unlikely to be compared). I then use the
euclidean_inner_join
to find all matching pairs across the two datasets:
set.seed(1) start <- Sys.time() joined_out <- euclidean_inner_join( X, X_2, threshold = .01, n_bands = 40, band_width = 8, r = .15 ) n_matches <- nrow(joined_out) time_taken <- Sys.time() - start print(paste("found", n_matches, "matches in", round(time_taken), "seconds"))
Zoomerjoin is able to easily find all pairs in just under 30s (perhaps longer on the runner that renders the website), even though the points lie in high-dimensional (d=100) space. This makes zoomerjoin a useful tool when trying to join or find matches between datasets of word or document embeddings.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.