textrank_keywords: Textrank - extract relevant keywords

Description Usage Arguments Value See Also Examples

View source: R/textrank.R

Description

The textrank algorithm allows to find relevant keywords in text. Where keywords are a combination of words following each other.

In order to find relevant keywords, the textrank algorithm constructs a word network. This network is constructed by looking which words follow one another. A link is set up between two words if they follow one another, the link gets a higher weight if these 2 words occur more frequenctly next to each other in the text.
On top of the resulting network the 'Pagerank' algorithm is applied to get the importance of each word. The top 1/3 of all these words are kept and are considered relevant. After this, a keywords table is constructed by combining the relevant words together if they appear following one another in the text.

Usage

1
2
3
4
5
6
7
textrank_keywords(
  x,
  relevant = rep(TRUE, length(x)),
  p = 1/3,
  ngram_max = 5,
  sep = "-"
)

Arguments

x

a character vector of words.

relevant

a logical vector indicating if the word is relevant or not. In the standard textrank algorithm, this is normally done by doing a Parts of Speech tagging and selecting which of the words are nouns and adjectives.

p

percentage (between 0 and 1) of relevant words to keep. Defaults to 1/3. Can also be an integer which than indicates how many words to keep. Specify +Inf if you want to keep all words.

ngram_max

integer indicating to limit keywords which combine ngram_max combinations of words which follow one another

sep

character string with the separator to paste the subsequent relevant words together

Value

an object of class textrank_keywords which is a list with elements:

See Also

page_rank

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
data(joboffer)
keywords <- textrank_keywords(joboffer$lemma,
                              relevant = joboffer$upos %in% c("NOUN", "VERB", "ADJ"))
subset(keywords$keywords, ngram > 1 & freq > 1)
keywords <- textrank_keywords(joboffer$lemma,
                              relevant = joboffer$upos %in% c("NOUN"),
                              p = 1/2, sep = " ")
subset(keywords$keywords, ngram > 1)

## plotting pagerank to see the relevance of each word
barplot(sort(keywords$pagerank$vector), horiz = TRUE,
        las = 2, cex.names = 0.5, col = "lightblue", xlab = "Pagerank")

Example output

               keyword ngram freq
4        data-analysis     2    4
9         data-science     2    3
14 consultancy-service     2    2
               keyword ngram freq
3        data analysis     2    5
5         data science     2    3
8  consultancy service     2    2
10   research question     2    1
12         text mining     2    1
14       master degree     2    1

textrank documentation built on Oct. 23, 2020, 5:21 p.m.