keyword_clean: Automatic keyword cleaning and transfer to tidy format

View source: R/keyword_clean.R

keyword_cleanR Documentation

Automatic keyword cleaning and transfer to tidy format

Description

Carry out several keyword cleaning processes automatically and return a tidy table with document ID and keywords.

Usage

keyword_clean(
  df,
  id = "id",
  keyword = "keyword",
  sep = ";",
  rmParentheses = TRUE,
  rmNumber = TRUE,
  lemmatize = FALSE,
  lemmatize_dict = NULL
)

Arguments

df

A data.frame containing at least two columns with document ID and keyword strings with separators.

id

Quoted characters specifying the column name of document ID.Default uses "id".

keyword

Quoted characters specifying the column name of keywords.Default uses "keyword".

sep

Separator(s) of keywords. Default uses ";".

rmParentheses

Remove the contents in the parentheses (including the parentheses) or not. Default uses TRUE.

rmNumber

Remove the pure number sequence or no. Default uses TRUE.

lemmatize

Lemmatize the keywords or not. Lemmatization is supported by 'lemmatize_strings' function in 'textstem' package.Default uses FALSE.

lemmatize_dict

A dictionary of base terms and lemmas to use for replacement. Only used when the lemmatize parameter is TRUE. The first column should be the full word form in lower case while the second column is the corresponding replacement lemma. Default uses NULL, this would apply the default dictionary used in lemmatize_strings function.

Details

The entire cleaning processes include: 1.Split the text with separators; 2.Remove the contents in the parentheses (including the parentheses); 3.Remove white spaces from start and end of string and reduces repeated white spaces inside a string; 4.Remove all the null character string and pure number sequences; 5.Convert all letters to lower case; 6.Lemmatization. Some of the procedures could be suppressed or activated with parameter adjustments. Default setting did not use lemmatization, it is suggested to use keyword_merge to merge the keywords afterward.

Value

A tbl with two columns, namely document ID and cleaned keywords.

See Also

keyword_merge

Examples

library(akc)

bibli_data_table

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword")

akc documentation built on Jan. 6, 2023, 9:09 a.m.