make_instances: Create MALLET instances from a document frame

make_instancesR Documentation

Create MALLET instances from a document frame

Description

Given a data frame of document IDs and texts (one per doc), such as that returned by wordcounts_texts, create a MALLET InstanceList object. This function is a simple wrapper for mallet.import. N.B. MALLET does tokenization, stopword removal, and casefolding on these texts, but if you have used wordcounts_texts, you may have already done those tasks yourself. To ensure MALLET does no further stoplisting, pass stoplist_file=NULL (the default). To ensure MALLET does no extra tokenization, pass token.regex="\S+" (whitespace tokenization—not the default). To prevent MALLET from casefolding, pass preserve.case=T. Or, equivalently, use the function wordcounts_instances instead.

Usage

make_instances(docs, stoplist_file = NULL, ...)

Arguments

docs

data frame with id and text columns

stoplist_file

name of a text file with one stopword per line, passed on to MALLET, if it exists. If it does not, or if this is NULL (the default), no words are removed.

...

passed on to mallet.import. A possibly important parameter to adjust is token.regex.

Details

The InstanceList object is the form in which MALLET understands a corpus. These are the objects passed on to the model-training routines. If saved to disk the same corpus may be used with command-line MALLET.

If java gives out-of-memory errors, try increasing the Java heap size to a large value, like 4GB, by setting options(java.parameters="-Xmx4g") before loading this package (or rJava).

Value

an rJava reference to a MALLET InstanceList

See Also

train_model, write_instances


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.