# lsh_probability: Probability that a candidate pair will be detected with LSH In textreuse: Detect Text Reuse and Document Similarity

## Description

Functions to help choose the correct parameters for the lsh and minhash_generator functions. Use lsh_threshold to determine the minimum Jaccard similarity for two documents for them to likely be considered a match. Use lsh_probability to determine the probability that a pair of documents with a known Jaccard similarity will be detected.

## Usage

 1 2 3 lsh_probability(h, b, s) lsh_threshold(h, b) 

## Arguments

 h The number of minhash signatures. b The number of LSH bands. s The Jaccard similarity.

## Details

Locality sensitive hashing returns a list of possible matches for similar documents. How likely is it that a pair of documents will be detected as a possible match? If h is the number of minhash signatures, b is the number of bands in the LSH function (implying then that the number of rows r = h / b), and s is the actual Jaccard similarity of the two documents, then the probability p that the two documents will be marked as a candidate pair is given by this equation.

p = 1 - (1 - s^{r})^{b}

According to MMDS, that equation approximates an S-curve. This implies that there is a threshold (t) for s approximated by this equation.

t = \frac{1}{b}^{\frac{1}{r}}

## References

Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011), ch. 3.

## Examples

 1 2 3 4 5 6 7 # Threshold for default values lsh_threshold(h = 200, b = 40) # Probability for varying values of s lsh_probability(h = 200, b = 40, s = .25) lsh_probability(h = 200, b = 40, s = .50) lsh_probability(h = 200, b = 40, s = .75) 

### Example output

[1] 0.4781762
[1] 0.03832775
[1] 0.7191538
[1] 0.9999803


textreuse documentation built on July 8, 2020, 6:40 p.m.