minhash_generator: Generate a minhash function

Description Usage Arguments Value References See Also Examples

View source: R/minhash.R

Description

A minhash value is calculated by hashing the strings in a character vector to integers and then selecting the minimum value. Repeated minhash values are generated by using different hash functions: these different hash functions are created by using performing a bitwise XOR operation (bitwXor) with a vector of random integers. Since it is vital that the same random integers be used for each document, this function generates another function which will always use the same integers. The returned function is intended to be passed to the hash_func parameter of TextReuseTextDocument.

Usage

1
minhash_generator(n = 200, seed = NULL)

Arguments

n

The number of minhashes that the returned function should generate.

seed

An option parameter to set the seed used in generating the random numbers to ensure that the same minhash function is used on repeated applications.

Value

A function which will take a character vector and return n minhashes.

References

Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011), ch. 3. See also Matthew Casperson, "Minhash for Dummies" (November 14, 2013).

See Also

lsh

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
set.seed(253)
minhash <- minhash_generator(10)

# Example with a TextReuseTextDocument
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
doc <- TextReuseTextDocument(file = file, hash_func = minhash,
                             keep_tokens = TRUE)
hashes(doc)

# Example with a character vector
is.character(tokens(doc))
minhash(tokens(doc))

Example output

 [1] -2134129939 -2140827722 -2142357869 -2145069965 -2146068904 -2145814060
 [7] -2127259801 -2145360837 -2144429817 -2133224999
[1] TRUE
 [1] -2134129939 -2140827722 -2142357869 -2145069965 -2146068904 -2145814060
 [7] -2127259801 -2145360837 -2144429817 -2133224999

textreuse documentation built on May 30, 2017, 3:32 a.m.