extract-term-ngrams: Split a text source into tokens and terms by date of...

Description Usage Arguments Details Value

Description

These functions transform a text source into a dataframe of individual terms and tokens with an occurrence date. These terms/tokens can be extracted as ngrams of specified length. terms_by_date is wrapper around the function for specific types of ngrams.

Usage

1
2
3
4
5
6
7
8
terms_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE,
  wordStemming = TRUE, customStopwords = NULL, tokenType = "unigram")

unigrams_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE,
  wordStemming = TRUE, customStopwords = NULL)

bigrams_by_date(textData, textColumn, dateColumn, removeNumbers = TRUE,
  wordStemming = TRUE, customStopwords = NULL)

Arguments

textData

a dataframe containing the text to be processed

textColumn

a character string specifying the column name in textData containing the text to be processed

dateColumn

a character string specifying the column name in textData specifying a publication date for the text in textColumn

removeNumbers

a Boolean indicating whether numbers should be removed from the result; default is TRUE.

wordStemming

a Boolean indicating whether words in the text should be reduced to the word stem; default is TRUE.

customStopwords

a character vector specifying additional stopwords that should be removed from the result

tokenType

the length of the consecutive token sequence extracted, currently only bigram (two word sequence) and unigram (single words) are supported, with unigram as default

Details

Text input (textColumn) is split with a word tokenizer, default stopwords (see tidytext) are removed and tokens are further processed and filtered according to the function's options. A term is the character sequence obtained after all NLP processing options this function offers have been applied, most importantly stemming, here the Porter stemmer from the SnowballC package is applied.

Value

a dataframe with three columns listing all individual term occurrences in the provided text source, where occur is the publication date associated with an original token, which has been processed/reduced to term; if no stemming has been applied the term and token in the result are identical


sdaume/topicsplorrr documentation built on Dec. 22, 2021, 11:11 p.m.