wordcounts_instances: Create MALLET instances from a word-counts data frame

wordcounts_instancesR Documentation

Create MALLET instances from a word-counts data frame

Description

Given a data frame representing documents as feature counts, create a MALLET InstanceList object which can then be passed on to train_model or saved to disk for later use with write_instances. This function is a small convenience wrapper for make_instances that ensures no further stopword removal, tokenization, or casefolding is done.

Usage

wordcounts_instances(
  counts,
  shuffle = FALSE,
  sep = " ",
  token_regex = "\\S+",
  preserve_case = TRUE
)

Arguments

counts

data frame with id, word, weight columns

shuffle

randomize word order before passing on to MALLET? (See wordcounts_texts

sep

separator to use between words

token_regex

regular expression matching a token. Ordinarily, this should correspond to sep (hence the default, whitespace tokenization), since no further tokenization should be done.

preserve_case

if FALSE, all words are lowercased by MALLET

Details

If your tokens themselves contain whitespace, change the sep parameter and adjust the token_regex accordingly.

Value

an rJava reference to a MALLET InstanceList

See Also

make_instances which this wraps, train_model, write_instances


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.