View source: R/keyword_extract.R
keyword_extract | R Documentation |
When we have raw text like abstract or article but not keywords, we might prefer extracting
keywords first. The least prerequisite data to be provided are a data.frame with document id and raw text,
and a user defined dictionary should be provided. One could use make_dict
function to construct his(her)
own dictionary with a character vector containing the vocabularies. If the dictionary is not provided,
the function would return all the ngram tokens without filtering (not recommended).
keyword_extract( dt, id = "id", text, dict = NULL, stopword = NULL, n_max = 4, n_min = 1 )
dt |
A data.frame containing at least two columns with document ID and text strings for extraction. |
id |
Quoted characters specifying the column name of document ID.Default uses "id". |
text |
Quoted characters specifying the column name of raw text for extraction. |
dict |
A data.table with two columns,namely "id" and "keyword"(set as key).
This should be exported by |
stopword |
A vector containing the stop words to be used. Default uses |
n_max |
The number of words in the n-gram. This must be an integer greater than or equal to 1. Default uses 4. |
n_min |
This must be an integer greater than or equal to 1, and less than or equal to n_max. Default uses 1. |
In the procedure of keyword extraction from akc,first the raw text would be split
into independent clause (namely split by puctuations of [,;!?.]
). Then the ngrams of the
clauses would be extracted. Finally, the phrases represented by ngrams should be in the dictionary
created by the user (using make_dict
).The user could also specify the n of ngrams.
This function could take some time if the sample size is large, it is suggested to use system.time to do some test first. Nonetheless, it has been optimized by data.table codes already and has good performance for big data.
A data.frame(tibble) with two columns, namely document ID and extracted keyword.
make_dict
library(akc) library(dplyr) bibli_data_table %>% keyword_clean(id = "id",keyword = "keyword") %>% pull(keyword) %>% make_dict -> my_dict tidytext::stop_words %>% pull(word) %>% unique() -> my_stopword bibli_data_table %>% keyword_extract(id = "id",text = "abstract", dict = my_dict,stopword = my_stopword)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.