preprocess | R Documentation |
A minimal text preprocessing utility.
preprocess(input, erase = "[^.?!:;'[:alnum:][:space:]]", lower_case = TRUE)
input |
a character vector. |
erase |
a length one character vector. Regular expression matching
parts of text to be erased from input. The default removes anything
not alphanumeric ( |
lower_case |
a length one logical vector. If TRUE, puts everything to lower case. |
The expressions preprocess(x, erase = pattern, lower_case = TRUE)
and
preprocess(x, erase = pattern, lower_case = FALSE)
are roughly
equivalent to tolower(gsub(pattern, "", x))
and
gsub(pattern, "", x)
, respectively, provided that the regular
expression 'pattern' is correctly recognized by R.
Note. This function, as well as tknz_sent, are included in the library for illustrative purposes only, and are not optimized for performance. Furthermore (for performance reasons) the function has a separate implementation for Windows and UNIX OS types, respectively, so that results obtained in the two cases may differ slightly. In contexts that require full reproducibility, users are encouraged to define their own preprocessing and tokenization custom functions - or to work with externally processed data.
a character vector containing the processed output.
Valerio Gherardi
preprocess("#This Is An Example@-@!#")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.