Description Usage Arguments Details References Examples

View source: R/lsh_probability.R

Functions to help choose the correct parameters for the `lsh`

and
`minhash_generator`

functions. Use `lsh_threshold`

to
determine the minimum Jaccard similarity for two documents for them to likely
be considered a match. Use `lsh_probability`

to determine the
probability that a pair of documents with a known Jaccard similarity will be
detected.

1 2 3 | ```
lsh_probability(h, b, s)
lsh_threshold(h, b)
``` |

`h` |
The number of minhash signatures. |

`b` |
The number of LSH bands. |

`s` |
The Jaccard similarity. |

Locality sensitive hashing returns a list of possible matches for
similar documents. How likely is it that a pair of documents will be detected
as a possible match? If `h`

is the number of minhash signatures,
`b`

is the number of bands in the LSH function (implying then that the
number of rows `r = h / b`

), and `s`

is the actual Jaccard
similarity of the two documents, then the probability `p`

that the two
documents will be marked as a candidate pair is given by this equation.

*p = 1 - (1 - s^{r})^{b}*

According to MMDS,
that equation approxmiates an S-curve. This implies that there is a threshold
(`t`

) for `s`

approximated by this equation.

*t = \frac{1}{b}^{\frac{1}{r}}*

Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
*Mining of Massive Datasets*
(Cambridge University Press, 2011), ch. 3.

1 2 3 4 5 6 7 | ```
# Threshold for default values
lsh_threshold(h = 200, b = 40)
# Probability for varying values of s
lsh_probability(h = 200, b = 40, s = .25)
lsh_probability(h = 200, b = 40, s = .50)
lsh_probability(h = 200, b = 40, s = .75)
``` |

```
[1] 0.4781762
[1] 0.03832775
[1] 0.7191538
[1] 0.9999803
```

textreuse documentation built on May 30, 2017, 3:32 a.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.