util: Preprocessing of Text Documents

Description Usage Arguments Details Value Author(s) Examples

Description

Functions for common preprocessing tasks of text documents,

Usage

1
2
tokenize(x, lines = FALSE, eol = "\n")
remove_stopwords(x, words, lines = FALSE)

Arguments

x

a vector of character.

eol

the end-of-line character to use.

words

a vector of character (tokens).

lines

assume the components are lines of text.

Details

tokenize is a simple regular expression based parser that splits the components of a vector of character into tokens while protecting infix punctuation. If lines = TRUE assume x was imported with readLines and end-of-line markers need to be added back to the components.

remove_stopwords removes the tokens given in words from x. If lines = FALSE assumes the components of both vectors contain tokens which can be compared using match. Otherwise, assumes the tokens in x are delimited by word boundaries (including infix punctuation) and uses regular expression matching.

Value

The same type of object as x.

Author(s)

Christian Buchta

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
txt <- "\"It's almost noon,\" it@dot.net said."
## split
x <- tokenize(txt)
x
## reconstruct
t <- paste(x, collapse = "")
t

if (require("tm", quietly = TRUE)) {
    words <- readLines(system.file("stopwords", "english.dat",
                       package = "tm"))
    remove_stopwords(x, words)
    remove_stopwords(t, words, lines = TRUE)
} else
    remove_stopwords(t, words = c("it", "it's"), lines = TRUE)

Example output

sh: 1: cannot create /dev/null: Permission denied
 [1] "\""         "It's"       " "          "almost"     " "         
 [6] "noon"       ","          "\""         " "          "it@dot.net"
[11] " "          "said"       "."         
[1] "\"It's almost noon,\" it@dot.net said."
[1] "\" almost noon,\" @dot.net said."

tau documentation built on July 21, 2021, 5:07 p.m.

Related to util in tau...