lsh: Locality sensitive hashing for minhash
In textreuse: Detect Text Reuse and Document Similarity

Description Usage Arguments Details Value References See Also Examples

Locality sensitive hashing (LSH) discovers potential matches among a corpus of documents quickly, so that only likely pairs can be compared.

1	lsh(x, bands, progress = interactive())

`x`	A `TextReuseCorpus` or `TextReuseTextDocument`.
`bands`	The number of bands to use for locality sensitive hashing. The number of hashes in the documents in the corpus must be evenly divisible by the number of bands. See `lsh_threshold` and `lsh_probability` for guidance in selecting the number of bands and hashes.
`progress`	Display a progress bar while comparing documents.

Locality sensitive hashing is a technique for detecting document similarity that does not require pairwise comparisons. When comparing pairs of documents, the number of pairs grows rapidly, so that only the smallest corpora can be compared pairwise in a reasonable amount of computation time. Locality sensitive hashing, on the other hand, takes a document which has been tokenized and hashed using a minhash algorithm. (See minhash_generator.) Each set of minhash signatures is then broken into bands comprised of a certain number of rows. (For example, 200 minhash signatures might be broken down into 20 bands each containing 10 rows.) Each band is then hashed to a bucket. Documents with identical rows in a band will be hashed to the same bucket. The likelihood that a document will be marked as a potential duplicate is proportional to the number of bands and inversely proportional to the number of rows in each band.

This function returns a data frame with the additional class lsh_buckets. The LSH technique only requires that the signatures for each document be calculated once. So it is possible, as long as one uses the same minhash function and the same number of bands, to combine the outputs from this function at different times. The output can thus be treated as a kind of cache of LSH signatures.

To extract pairs of documents from the output of this function, see lsh_candidates.

A data frame (with the additional class lsh_buckets), containing a column with the document IDs and a column with their LSH signatures, or buckets.

Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011), ch. 3. See also Matthew Casperson, "Minhash for Dummies" (November 14, 2013).

minhash_generator, lsh_candidates, lsh_query, lsh_probability, lsh_threshold

dir <- system.file("extdata/legal", package = "textreuse")
minhash <- minhash_generator(200, seed = 235)
corpus <- TextReuseCorpus(dir = dir,
                          tokenizer = tokenize_ngrams, n = 5,
                          minhash_func = minhash)
buckets <- lsh(corpus, bands = 50)
buckets

# A tibble: 150 x 2
            doc                          buckets
          <chr>                            <chr>
 1 ca1851-match f98ef40288d182d970d569b9843a7815
 2 ca1851-match a32dce9275e80b4f8d02469fea256544
 3 ca1851-match 6a51a6761626e52df253779366be28ed
 4 ca1851-match 00fb5704f5f57d98a8b330362bb2e0b5
 5 ca1851-match 759317127c45bbffeffe15080d6db2b0
 6 ca1851-match 1a7dc9a6f10f13a005b014eeaeae07b4
 7 ca1851-match d9e25df6576a990b84a64fce3b942f4c
 8 ca1851-match 41a67cd0279e6aa06d966113e88c66b5
 9 ca1851-match 26747bdebb2d4bc4d1659daefbb3e154
10 ca1851-match cb44f753f81245032ebf07b2d22f8b16
# ... with 140 more rows