View source: R/lsh_probability.R
lsh_probability | R Documentation |
Functions to help choose the correct parameters for the lsh
and
minhash_generator
functions. Use lsh_threshold
to
determine the minimum Jaccard similarity for two documents for them to likely
be considered a match. Use lsh_probability
to determine the
probability that a pair of documents with a known Jaccard similarity will be
detected.
lsh_probability(h, b, s)
lsh_threshold(h, b)
h |
The number of minhash signatures. |
b |
The number of LSH bands. |
s |
The Jaccard similarity. |
Locality sensitive hashing returns a list of possible matches for
similar documents. How likely is it that a pair of documents will be detected
as a possible match? If h
is the number of minhash signatures,
b
is the number of bands in the LSH function (implying then that the
number of rows r = h / b
), and s
is the actual Jaccard
similarity of the two documents, then the probability p
that the two
documents will be marked as a candidate pair is given by this equation.
p = 1 - (1 - s^{r})^{b}
According to MMDS,
that equation approximates an S-curve. This implies that there is a threshold
(t
) for s
approximated by this equation.
t = \frac{1}{b}^{\frac{1}{r}}
Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011), ch. 3.
# Threshold for default values
lsh_threshold(h = 200, b = 40)
# Probability for varying values of s
lsh_probability(h = 200, b = 40, s = .25)
lsh_probability(h = 200, b = 40, s = .50)
lsh_probability(h = 200, b = 40, s = .75)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.