keywords_rake | R Documentation |
RAKE is a basic algorithm which tries to identify keywords in text. Keywords are
defined as a sequence of words following one another.
The algorithm goes as follows.
candidate keywords are extracted by looking to a contiguous sequence of words which do not contain irrelevant words
a score is being calculated for each word which is part of any candidate keyword, this is done by
among the words of the candidate keywords, the algorithm looks how many times each word is occurring and how many times it co-occurs with other words
each word gets a score which is the ratio of the word degree (how many times it co-occurs with other words) to the word frequency
a RAKE score for the full candidate keyword is calculated by summing up the scores of each of the words which define the candidate keyword
The resulting keywords are returned as a data.frame together with their RAKE score.
keywords_rake( x, term, group, relevant = rep(TRUE, nrow(x)), ngram_max = 2, n_min = 2, sep = " " )
x |
a data.frame with one row per term as returned by |
term |
character string with a column in the data frame |
group |
a character vector with 1 or several columns from |
relevant |
a logical vector of the same length as |
ngram_max |
integer indicating the maximum number of words that there should be in each keyword |
n_min |
integer indicating the frequency of how many times a keywords should at least occur in the data in order to be returned. Defaults to 2. |
sep |
character string with the separator which will be used to |
a data.frame with columns keyword, ngram and rake which is ordered from low to high rake
keyword: the keyword
ngram: how many terms are in the keyword
freq: how many times did the keyword occur
rake: the ratio of the degree to the frequency as explained in the description, summed up for all words from the keyword
Rose, Stuart & Engel, Dave & Cramer, Nick & Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents. Text Mining: Applications and Theory. 1 - 20. 10.1002/9780470689646.ch1.
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "nl") keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", relevant = x$xpos %in% c("NN", "JJ")) head(keywords) x <- subset(brussels_reviews_anno, language == "fr") keywords <- keywords_rake(x = x, term = "lemma", group = c("doc_id", "sentence_id"), relevant = x$xpos %in% c("NN", "JJ"), ngram_max = 10, n_min = 2, sep = "-") head(keywords)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.