minhash_generator | R Documentation |
A minhash value is calculated by hashing the strings in a character vector to
integers and then selecting the minimum value. Repeated minhash values are
generated by using different hash functions: these different hash functions
are created by using performing a bitwise XOR
operation
(bitwXor
) with a vector of random integers. Since it is vital
that the same random integers be used for each document, this function
generates another function which will always use the same integers. The
returned function is intended to be passed to the hash_func
parameter
of TextReuseTextDocument
.
minhash_generator(n = 200, seed = NULL)
n |
The number of minhashes that the returned function should generate. |
seed |
An option parameter to set the seed used in generating the random numbers to ensure that the same minhash function is used on repeated applications. |
A function which will take a character vector and return n
minhashes.
Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011), ch. 3. See also Matthew Casperson, "Minhash for Dummies" (November 14, 2013).
lsh
set.seed(253)
minhash <- minhash_generator(10)
# Example with a TextReuseTextDocument
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
doc <- TextReuseTextDocument(file = file, hash_func = minhash,
keep_tokens = TRUE)
hashes(doc)
# Example with a character vector
is.character(tokens(doc))
minhash(tokens(doc))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.