textrank_sentences: Textrank - extract relevant sentences
In bnosac/textrank: Summarize Text by Ranking Sentences and Finding Keywords

Description Usage Arguments Value See Also Examples

View source: R/textrank.R

The textrank algorithm is a technique to rank sentences in order of importance.

In order to find relevant sentences, the textrank algorithm needs 2 inputs: a data.frame (data) with sentences and a data.frame (terminology) containing tokens which are part of each sentence.
Based on these 2 datasets, it calculates the pairwise distance between each sentence by computing how many terms are overlapping (Jaccard distance, implemented in textrank_jaccard). These pairwise distances among the sentences are next passed on to Google's pagerank algorithm to identify the most relevant sentences.

If data contains many sentences, it makes sense not to compute all pairwise sentence distances but instead limiting the calculation of the Jaccard distance to only sentence combinations which are limited by the Minhash algorithm. This is implemented in textrank_candidates_lsh and an example is show below.

textrank_sentences(
  data,
  terminology,
  textrank_dist = textrank_jaccard,
  textrank_candidates = textrank_candidates_all(data$textrank_id),
  max = 1000,
  options_pagerank = list(directed = FALSE),
  ...
)

`data`	a data.frame with 1 row per sentence where the first column is an identifier of a sentence (e.g. textrank_id) and the second column is the raw sentence. See the example.
`terminology`	a data.frame with with one row per token indicating which token is part of each sentence. The first column in this data.frame is the identifier which corresponds to the first column of `data` and the second column indicates the token which is part of the sentence which will be passed on to `textrank_dist`. See the example.
`textrank_dist`	a function which calculates the distance between 2 sentences which are represented by a vectors of tokens. The first 2 arguments of the function are the tokens in sentence1 and sentence2. The function should return a numeric value of length one. The larger the value, the larger the connection between the 2 vectors indicating more strength. Defaults to the jaccard distance (`textrank_jaccard`), indicating the percent of common tokens.
`textrank_candidates`	a data.frame of candidate sentence to sentence comparisons with columns textrank_id_1 and textrank_id_2 indicating for which combination of sentences we want to compute the Jaccard distance or the distance function as provided in `textrank_dist`. See for example `textrank_candidates_all` or `textrank_candidates_lsh`.
`max`	integer indicating to reduce the number of sentence to sentence combinations to compute. In case provided, we take only this max amount of rows from `textrank_candidates`
`options_pagerank`	a list of arguments passed on to `page_rank`
`...`	arguments passed on to `textrank_dist`

an object of class textrank_sentences which is a list with elements:

sentences: a data.frame with columns textrank_id, sentence and textrank where the textrank is the Google Pagerank importance metric of the sentence
sentences_dist: a data.frame with columns textrank_id_1, textrank_id_2 (the sentence id) and weight which is the result of the computed distance between the 2 sentences
pagerank: the result of a call to page_rank

page_rank, textrank_candidates_all, textrank_candidates_lsh, textrank_jaccard

library(udpipe)
data(joboffer)
head(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
cat(sentences$sentence)
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
head(terminology)

## Textrank for finding the most relevant sentences
tr <- textrank_sentences(data = sentences, terminology = terminology)
summary(tr, n = 2)
summary(tr, n = 5, keep.sentence.order = TRUE)

## Not run: 
## Using minhash to reduce sentence combinations - relevant if you have a lot of sentences
library(textreuse)
minhash <- minhash_generator(n = 1000, seed = 123456789)
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash, bands = 500)
tr <- textrank_sentences(data = sentences, terminology = terminology,
                         textrank_candidates = candidates)
summary(tr, n = 2)

## End(Not run)
## You can also reduce the number of sentence combinations by sampling
tr <- textrank_sentences(data = sentences, terminology = terminology, max = 100)
tr
summary(tr, n = 2)