terms_dfm: Create a document-feature-matrix from a text source

Description Usage Arguments Details Value

View source: R/term-extract.R

Description

terms_dfm takes a text source with text objects associated with unique document identifiers and creates a document-feature-matrix, which can be used as input for an stm topic modeller.

Usage

1
2
3
terms_dfm(textData, textColumn, documentIdColumn,
  removeStopwords = FALSE, removeNumbers = FALSE,
  wordStemming = FALSE, customStopwords = NULL)

Arguments

textData

a dataframe containing the text to be processed, with each row representing a distinct document

textColumn

the column name in textData containing the text to be processed

documentIdColumn

the column name in textData specifying a unique identifier for the document with the content given in textColumn

removeStopwords

a Boolean indicating whether standard stopwords (see tidytext) should be removed from the result; default is FALSE.

removeNumbers

a Boolean indicating whether numbers should be removed from the result; default is FALSE. If TRUE, a the Porter stemmer from the SnowballC package is applied.

wordStemming

a Boolean indicating whether words in the text should be reduced to the word stem; default is FALSE.

customStopwords

a character vector specifying additional stopwords that should be removed from the result

Details

Text input (textColumn) is split with a word tokenizer and tokens are further processed and filtered according to the function's options. Since the result is primarily intended as input for a topic modeller, stopwords (see tidytext) are not removed by default.

Value

a document-feature-matrix of type quanteda::dfm (similar to a document-term-matrix), where a document is identified by the value in the documentIdColumn specified in the text source (i.e. textData), and a feature or term is a character sequence obtained after tokenization and all other NLP processing options have been applied to the text associated with a document.


sdaume/topicsplorrr documentation built on Dec. 22, 2021, 11:11 p.m.