textrank_candidates_lsh: Use locality-sensitive hashing to get combinations of...
In jwijffels/textrank: Summarize Text by Ranking Sentences and Finding Keywords

Description Usage Arguments Value See Also Examples

View source: R/textrank.R

This functionality is usefull if there are a lot of sentences and most of the sentences have no overlapping words in there. In order not to compute the jaccard distance among all possible combinations of sentences as is done by using textrank_candidates_all, we can reduce the combinations of sentences by using the Minhash algorithm. This function sets up the combinations of sentences which are in the same Minhash bucket.

1	textrank_candidates_lsh(x, sentence_id, minhashFUN, bands)

`x`	a character vector of words or terms
`sentence_id`	a character vector of identifiers of sentences where the words/terms provided in `x` are part of the sentence. The length of `sentence_id` should be the same length of `x`
`minhashFUN`	a function which returns a minhash of a character vector. See the examples or look at `minhash_generator`
`bands`	integer indicating to break down the minhashes in `bands` number of bands. Mark that the number of minhash signatures should always be a multiple of the number of local sensitive hashing bands. See the example

a data.frame with 2 columns textrank_id_1 and textrank_id_2 containing identifiers of sentences sentence_id which contained terms in the same minhash bucket. This data.frame can be used as input in the textrank_sentences algorithm.

textrank_sentences

library(textreuse)
library(udpipe)
lsh_probability(h = 1000, b = 500, s = 0.1) # A 10 percent Jaccard overlap will be detected well

minhash <- minhash_generator(n = 1000, seed = 123456789)

data(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash, bands = 500)
head(candidates)
tr <- textrank_sentences(data = sentences, terminology = terminology,
                         textrank_candidates = candidates)
summary(tr, n = 2)