Description Usage Arguments Details Value Author(s) Examples
Functions for common preprocessing tasks of text documents,
1 2 |
x |
a vector of character. |
eol |
the end-of-line character to use. |
words |
a vector of character (tokens). |
lines |
assume the components are lines of text. |
tokenize
is a simple regular expression based parser that
splits the components of a vector of character into tokens while
protecting infix punctuation. If lines = TRUE
assume x
was imported with readLines
and end-of-line markers need to be
added back to the components.
remove_stopwords
removes the tokens given in words
from
x
. If lines = FALSE
assumes the components of both
vectors contain tokens which can be compared using match
.
Otherwise, assumes the tokens in x
are delimited by word
boundaries (including infix punctuation) and uses regular expression
matching.
The same type of object as x
.
Christian Buchta
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | txt <- "\"It's almost noon,\" it@dot.net said."
## split
x <- tokenize(txt)
x
## reconstruct
t <- paste(x, collapse = "")
t
if (require("tm", quietly = TRUE)) {
words <- readLines(system.file("stopwords", "english.dat",
package = "tm"))
remove_stopwords(x, words)
remove_stopwords(t, words, lines = TRUE)
} else
remove_stopwords(t, words = c("it", "it's"), lines = TRUE)
|
sh: 1: cannot create /dev/null: Permission denied
[1] "\"" "It's" " " "almost" " "
[6] "noon" "," "\"" " " "it@dot.net"
[11] " " "said" "."
[1] "\"It's almost noon,\" it@dot.net said."
[1] "\" almost noon,\" @dot.net said."
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.