Description Usage Arguments Value See Also Examples
This functionality is usefull if there are a lot of sentences and most of the sentences have no overlapping
words in there. In order not to compute the jaccard distance among all possible combinations of sentences as is
done by using textrank_candidates_all
, we can reduce the combinations of sentences by using the Minhash algorithm.
This function sets up the combinations of sentences which are in the same Minhash bucket.
1 | textrank_candidates_lsh(x, sentence_id, minhashFUN, bands)
|
x |
a character vector of words or terms |
sentence_id |
a character vector of identifiers of sentences where the words/terms provided in |
minhashFUN |
a function which returns a minhash of a character vector. See the examples or look at |
bands |
integer indicating to break down the minhashes in |
a data.frame with 2 columns textrank_id_1 and textrank_id_2 containing identifiers of sentences sentence_id
which contained terms in the same minhash bucket.
This data.frame can be used as input in the textrank_sentences
algorithm.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | library(textreuse)
library(udpipe)
lsh_probability(h = 1000, b = 500, s = 0.1) # A 10 percent Jaccard overlap will be detected well
minhash <- minhash_generator(n = 1000, seed = 123456789)
data(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
minhashFUN = minhash, bands = 500)
head(candidates)
tr <- textrank_sentences(data = sentences, terminology = terminology,
textrank_candidates = candidates)
summary(tr, n = 2)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.