Description Usage Arguments Value Examples
Conduct a series of preprocessing steps on raw documents. By default, a very limited amount of preprocessing will occur (just basic stuff like remove documents that are blank or NA and then tokenize documents by separating by whitespace). The user can optionally filter documents, perform global substitutions using regular expressions, remove stopwords, and perform stemming.
1 2 3 |
data |
a character vector containing the raw corpus, where each element is a document. |
exact |
a (case-sensitive) character vector in which
each element is a string, phrase, or longer snippet of
text that results in a document being discarded from the
data if the entire document matches an element of
|
partial |
a (case-sensitive) character vector in
which each element is a string, phrase, or longer snippet
of text that results in a document being discarded from
the data if any part of the document matches an element
of |
subs |
character vector of regular expressions where
the odd-numbered element(s) are removed from the corpus
and the subsequent even-numbered element are inserted in
their place. These substitutions are performed using the
|
stopwords |
character vector of tokens that should be excluded from the vocabulary. |
cutoff |
The minimum number of times a token must appear in the corpus in order to be included in the vocabulary. |
verbose |
logical. If set to TRUE the function will
retain the indices of the elements of |
quiet |
logical. Should a summary of the preprocessing steps be printed to the screen? |
stem |
logical. Should the porter stemmer be used to stem the tokens in the vocabulary? |
hash |
a length-1 character vector indicating the prefix of substitution replacements that should be replaced with a '#' symbol after tokenizing. Set to "ent" by default, where "ent" stands for "entity", and is often used as a prefix to a substitution replacement for a class of terms, like dollar amounts ("entdollaramount") and timestamps ("entdatestamp", "enttimeofday"), etc. |
Returns a list of length five. The first element,
term.id
, is an integer vector containing the index
in the vocabulary of each token in the corpus. If the 4th
token in the corpus is "tree" and "tree" is the 50th
element of the vocabulary, then the 4th element of term.id
will be 50. The second element, doc.id
, is an
integer vector which corresponds to the document each token
belongs to. The third element, vocab
, is the
vocabulary of the corpus, which contains all the terms
(i.e. unique tokens) in the data. It is sorted in
decreasing order of term frequency by default. The fourth
element, category
has length equal to the number of
input documents in data
. If the value of an element
in this vector is 0, then the corresponding document was
retained. Otherwise, it was discarded. If the value is
positive, it was an exact or partial match and if
verbose == TRUE
then the value points to the
relevant element of exact
or partial
. If the
value is -1, then the document contained no tokens in the
vocabulary after removing stopwords
and applying the
cutoff
. The fifth element, call
, is a named
list returning the arguments supplied to the
preprocess
function: exact, partial, subs,
stopwords, stem, and cutoff.
1 2 3 4 5 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.