tm_clean: Clean subject line text prior to analysis

View source: R/tm_clean.R

tm_cleanR Documentation

Clean subject line text prior to analysis

Description

This function processes the Subject column in a Meeting Query by applying tokenisation usingtidytext::unnest_tokens(), and removing any stopwords supplied in a data frame (using the argument stopwords). This is a sub-function that feeds into tm_freq(), tm_cooc(), and tm_wordcloud(). The default is to return a data frame with tokenised counts of words or ngrams.

Usage

tm_clean(data, token = "words", stopwords = NULL, ...)

Arguments

data

A Meeting Query dataset in the form of a data frame.

token

A character vector accepting either "words" or "ngrams", determining type of tokenisation to return.

stopwords

A character vector OR a single-column data frame labelled 'word' containing custom stopwords to remove.

...

Additional parameters to pass to tidytext::unnest_tokens().

Value

data frame with two columns:

  • line

  • word

See Also

Other Text-mining: meeting_tm_report(), pairwise_count(), subject_validate_report(), subject_validate(), tm_cooc(), tm_freq(), tm_wordcloud()

Examples

# words
tm_clean(mt_data)

# ngrams
tm_clean(mt_data, token = "ngrams")


wpa documentation built on Aug. 21, 2023, 5:11 p.m.