Getting started"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  progress = FALSE,
  error = FALSE, 
  message = FALSE
)

options(digits = 2)

What is the RAKE algorithm?

The Rapid Automatic Keyword Extraction (RAKE) algorithm was first described in Rose et al. as a way to quickly extract keywords from documents. The algorithm invovles two main steps:

1. Identify candidate keywords. A candidate keyword is any set of contiguous words (i.e., any n-gram) that doesn't contain a phrase delimiter or stop word.[^1] A phrase delimiter is a punctuation character that marks the beginning or end of a phrase (e.g., a period or comma). Splitting up text based on phrase delimiters/stop words is the essential idea behind RAKE. According to the authors:

RAKE is based on our observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words and, the, and of, or other words with minimal lexical meaning

In addition to using stop words and phrase delimiters to identify the candidate keywords, you can also use a word's part-of-speech (POS). For example, most keywords don't contain verbs, so you may want treat verbs as if they were stop words. You can use slowrake()'s stop_pos parameter to choose which parts-of-speech to exclude from your candidate keywords.

2. Calculate each keyword's score. A keyword's score (i.e., its degree of "keywordness") is the sum of its member word scores. For example, the score for the keyword "dog leash" is calculated by adding the score for the word "dog" with the score for the word "leash." A member word's score is equal to its degree/frequency, where degree equals the number of times that the word co-occurs with other words (in other keywords), and frequency is the total number of times that the word occurs overall.

See Rose et al. for more details on how RAKE works.

Examples

RAKE is unique in that it is completely unsupervised (i.e., no training data is required), so it's relatively easy to use. Let's take a look at a few basic examples that demonstrate slowrake()'s various parameters.

library(slowraker)

txt <- "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."
slowrake(txt = txt)[[1]]
slowrake(txt = txt, stem = FALSE)[[1]]
slowrake(txt = txt, stop_words = c(smart_words, "diophantine"))[[1]]
slowrake(txt = txt, stop_pos = NULL)[[1]]
slowrake(txt = txt, stop_pos = pos_tags$tag[!grepl("^N", pos_tags$tag)])[[1]]
res <- slowrake(txt = txt)[[1]]
res2 <- aggregate(freq ~ keyword + stem, data = res, FUN = sum)
res2[order(res2$freq, decreasing = TRUE), ]
slowrake(txt = dog_pubs$abstract[1:10])

[^1]: Technically, the original version of RAKE allows some keywords to contain stop words, but slowrake() does not allow for this.



Try the slowraker package in your browser

Any scripts or data that you put into this service are public.

slowraker documentation built on May 2, 2019, 3:26 p.m.