make_instances | R Documentation |
Given a data frame of document IDs and texts (one per doc), such as
that returned by wordcounts_texts
, create a MALLET
InstanceList
object. This function is a simple wrapper for
mallet.import
. N.B. MALLET does tokenization,
stopword removal, and casefolding on these texts, but if you have
used wordcounts_texts
, you may have already done
those tasks yourself. To ensure MALLET does no further stoplisting,
pass stoplist_file=NULL
(the default). To ensure MALLET does
no extra tokenization, pass token.regex="\S+"
(whitespace
tokenization—not the default). To prevent MALLET from
casefolding, pass preserve.case=T
. Or, equivalently, use the
function wordcounts_instances
instead.
make_instances(docs, stoplist_file = NULL, ...)
docs |
data frame with |
stoplist_file |
name of a text file with one stopword per line, passed
on to MALLET, if it exists. If it does not, or if this is |
... |
passed on to |
The InstanceList
object is the form in which MALLET
understands a corpus. These are the objects passed on to the
model-training routines. If saved to disk the same corpus may be used
with command-line MALLET.
If java gives out-of-memory errors, try increasing the Java heap size to a
large value, like 4GB, by setting options(java.parameters="-Xmx4g")
before loading this package (or rJava).
an rJava reference to a MALLET InstanceList
train_model
, write_instances
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.