lsh_compare: Compare candidates identified by LSH

Description Usage Arguments Value Examples

Description

The lsh_candidates only identifies potential matches, but cannot estimate the actual similarity of the documents. This function takes a data frame returned by lsh_candidates and applies a comparison function to each of the documents in a corpus, thereby calculating the document similarity score. Note that since your corpus will have minhash signatures rather than hashes for the tokens itself, you will probably wish to use tokenize to calculate new hashes. This can be done for just the potentially similar documents. See the package vignettes for details.

Usage

1
lsh_compare(candidates, corpus, f, progress = interactive())

Arguments

candidates

A data frame returned by lsh_candidates.

corpus

The same TextReuseCorpus corpus which was used to generate the candidates.

f

A comparison function such as jaccard_similarity.

progress

Display a progress bar while comparing documents.

Value

A data frame with values calculated for score.

Examples

1
2
3
4
5
6
7
8
dir <- system.file("extdata/legal", package = "textreuse")
minhash <- minhash_generator(200, seed = 234)
corpus <- TextReuseCorpus(dir = dir,
                          tokenizer = tokenize_ngrams, n = 5,
                          minhash_func = minhash)
buckets <- lsh(corpus, bands = 50)
candidates <- lsh_candidates(buckets)
lsh_compare(candidates, corpus, jaccard_similarity)


Search within the textreuse package
Search all R packages, documentation and source code

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.