preprocess.newdocs: Preprocess raw version of new documents based on previously...
In kshirley/LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

Description Usage Arguments Value

This function performs the same preprocessing steps as the function preprocess(), except this time for a set of new documents whose topic proportions we wish to estimate given the topics from a previously fit model. The key difference is that the vocabulary won't be constructed based on the words that occur in the new documents, but rather it will be entered as input from the previously fit model.

1 2	preprocess.newdocs(data = character(), vocab = character(), exact = NULL, partial = NULL, subs = NULL, verbose = FALSE, quiet = FALSE)

`data`	a character vector containing the raw corpus, where each element is a document.
`vocab`	a character vector containing the vocabulary of the fitted topic model from which we will estimate the topic proportions for the new documents entered in `data`.
`exact`	a (case-sensitive) character vector in which each element is a string, phrase, or longer snippet of text that results in a document being discarded from the data if the entire document matches an element of `exact`.
`partial`	a (case-sensitive) character vector in which each element is a string, phrase, or longer snippet of text that results in a document being discarded from the data if any part of the document matches an element of `partial`.
`subs`	character vector of regular expressions where the odd-numbered element(s) are removed from the corpus and the subsequent even-numbered element are inserted in their place. These substitutions are performed using the `gsub()` function after forcing the raw text to lowercase.
`verbose`	logical. If set to TRUE the function will retain the indices of the elements of `exact` and `partial` that were matched. For instance, if a document exactly matches the third element of `exact`, then the corresponding value of `category` will be 3, if `verbose = TRUE`
`quiet`	logical. Should a summary of the preprocessing steps be printed to the screen?
`stem`	logical. Should the porter stemmer be used to stem the tokens in the vocabulary?
`hash`	a length-1 character vector indicating the prefix of substitution replacements that should be replaced with a '#' symbol after tokenizing. Set to "ent" by default, where "ent" stands for "entity", and is often used as a prefix to a substitution replacement for a class of terms, like dollar amounts ("entdollaramount") and timestamps ("entdatestamp", "enttimeofday"), etc.

Returns a list of length three. The first element, term.id, is an integer vector containing the index in the vocabulary of each token in the corpus. If the 4th token in the corpus is "tree" and "tree" is the 50th element of the vocabulary, then the 4th element of term.id will be 50, for example. The second element, doc.id, is an integer vector which corresponds to the document each token belongs to. The third element, category has length equal to the number of documents. If the value of an element in this vector is 0, then the corresponding document was retained. Otherwise, it was discarded. If the value is positive, it was an exact or partial match and if verbose == TRUE then the value points to the relevant element of exact or partial. If the value is -1, then the document contained no tokens in the vocabulary.

kshirley/LDAtools documentation built on May 20, 2019, 7:03 p.m.

kshirley/LDAtools index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kshirley/LDAtools
Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

preprocess.newdocs: Preprocess raw version of new documents based on previously...
In kshirley/LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

Description

Usage

Arguments

Value

Related to preprocess.newdocs in kshirley/LDAtools...

R Package Documentation

Browse R Packages

We want your feedback!

kshirley/LDAtools Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

preprocess.newdocs: Preprocess raw version of new documents based on previously... In kshirley/LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

Description

Usage

Arguments

Value

Related to preprocess.newdocs in kshirley/LDAtools...

R Package Documentation

Browse R Packages

We want your feedback!

kshirley/LDAtools
Tools to fit a topic model using Latent Dirichlet Allocation (LDA)

preprocess.newdocs: Preprocess raw version of new documents based on previously...
In kshirley/LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA)