PreprocessText: Preprocess character vectors
In dimalik/EntropyEstimator: Implements LZ estimator for texts

View source: R/utils.R

PreprocessText

R Documentation

Preprocess character vectors

Description

This provides some elementary preprocessing for a read character vector such as lowercasing and bag-of-words normalization. The bow normalization step substitutes each element of the vector with a numeric value (its ID). This can be quite useful in non-ASCII texts or texts containing words with boundary symbols where the regular expression can fail.

Usage

PreprocessText(text, lower = FALSE, bow = TRUE)

Arguments

`text`	A character vector. This contains the text as returned by `scan`.
`lower`	Boolean. Whether or not to lowercase all words.
`bow`	Boolean. Whether or not to substitute each word with an ID tag (useful for non-ASCII texts)

Value

A character vector.

Examples

txt <- c("This", "is", "a", "Sentence", "containing", "UPPERCASE", "lowercase", "and", "sy.mb'ols")
txt.norm <- PreprocessText(txt, lower = TRUE, bow = TRUE)

dimalik/EntropyEstimator documentation built on Sept. 3, 2024, 5:15 a.m.