PreprocessText: Preprocess character vectors

Description Usage Arguments Value See Also Examples

Description

This provides some elementary preprocessing for a read character vector such as lowercasing and bag-of-words normalization. The bow normalization step substitutes each element of the vector with a numeric value (its ID). This can be quite useful in non-ASCII texts or texts containing words with boundary symbols where the regular expression can fail.

Usage

1
PreprocessText(text, lower = FALSE, bow = TRUE)

Arguments

text

A character vector. This contains the text as returned by scan.

lower

Boolean. Whether or not to lowercase all words.

bow

Boolean. Whether or not to substitute each word with an ID tag (useful for non-ASCII texts)

Value

A character vector.

See Also

tolower

Examples

1
2
txt <- c("This", "is", "a", "Sentence", "containing", "UPPERCASE", "lowercase", "and", "sy.mb'ols")
txt.norm <- PreprocessText(txt, lower = TRUE, bow = TRUE)

dimalik/EntropyEstimator documentation built on May 15, 2019, 8:44 a.m.