preprocess: Preprocess raw documents according to various options
In kshirley/LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

Description Usage Arguments Value Examples

Conduct a series of preprocessing steps on raw documents. By default, a very limited amount of preprocessing will occur (just basic stuff like remove documents that are blank or NA and then tokenize documents by separating by whitespace). The user can optionally filter documents, perform global substitutions using regular expressions, remove stopwords, and perform stemming.

1
2
3

preprocess(data, exact = NULL, partial = NULL, subs = NULL,
  stopwords = NULL, cutoff = 2, verbose = FALSE, quiet = FALSE,
  stem = FALSE, hash = "ent")

`data`	a character vector containing the raw corpus, where each element is a document.
`exact`	a (case-sensitive) character vector in which each element is a string, phrase, or longer snippet of text that results in a document being discarded from the data if the entire document matches an element of `exact`.
`partial`	a (case-sensitive) character vector in which each element is a string, phrase, or longer snippet of text that results in a document being discarded from the data if any part of the document matches an element of `partial`.
`subs`	character vector of regular expressions where the odd-numbered element(s) are removed from the corpus and the subsequent even-numbered element are inserted in their place. These substitutions are performed using the `gsub()` function after forcing the raw text to lowercase.
`stopwords`	character vector of tokens that should be excluded from the vocabulary.
`cutoff`	The minimum number of times a token must appear in the corpus in order to be included in the vocabulary.
`verbose`	logical. If set to TRUE the function will retain the indices of the elements of `exact` and `partial` that were matched. For instance, if a document exactly matches the third element of `exact`, then the corresponding value of `category` will be 3, if `verbose = TRUE`
`quiet`	logical. Should a summary of the preprocessing steps be printed to the screen?
`stem`	logical. Should the porter stemmer be used to stem the tokens in the vocabulary?
`hash`	a length-1 character vector indicating the prefix of substitution replacements that should be replaced with a '#' symbol after tokenizing. Set to "ent" by default, where "ent" stands for "entity", and is often used as a prefix to a substitution replacement for a class of terms, like dollar amounts ("entdollaramount") and timestamps ("entdatestamp", "enttimeofday"), etc.

Returns a list of length five. The first element, term.id, is an integer vector containing the index in the vocabulary of each token in the corpus. If the 4th token in the corpus is "tree" and "tree" is the 50th element of the vocabulary, then the 4th element of term.id will be 50. The second element, doc.id, is an integer vector which corresponds to the document each token belongs to. The third element, vocab, is the vocabulary of the corpus, which contains all the terms (i.e. unique tokens) in the data. It is sorted in decreasing order of term frequency by default. The fourth element, category has length equal to the number of input documents in data. If the value of an element in this vector is 0, then the corresponding document was retained. Otherwise, it was discarded. If the value is positive, it was an exact or partial match and if verbose == TRUE then the value points to the relevant element of exact or partial. If the value is -1, then the document contained no tokens in the vocabulary after removing stopwords and applying the cutoff. The fifth element, call, is a named list returning the arguments supplied to the preprocess function: exact, partial, subs, stopwords, stem, and cutoff.

data(APcorpus)
data(stopwords)
input <- preprocess(data=APcorpus, exact=NULL, partial=NULL, subs=NULL,
                    stopwords=stopwords, cutoff=5, verbose=FALSE,
                    quiet=FALSE, stem=FALSE, hash="ent")

kshirley/LDAtools documentation built on May 20, 2019, 7:03 p.m.

kshirley/LDAtools index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kshirley/LDAtools
Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

preprocess: Preprocess raw documents according to various options
In kshirley/LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

Description

Usage

Arguments

Value

Examples

Related to preprocess in kshirley/LDAtools...

R Package Documentation

Browse R Packages

We want your feedback!

kshirley/LDAtools Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

preprocess: Preprocess raw documents according to various options In kshirley/LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

Description

Usage

Arguments

Value

Examples

Related to preprocess in kshirley/LDAtools...

R Package Documentation

Browse R Packages

We want your feedback!

kshirley/LDAtools
Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

preprocess: Preprocess raw documents according to various options
In kshirley/LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA)