wordcounts_texts: Convert long-format word-counts into documents

wordcounts_textsR Documentation

Convert long-format word-counts into documents

Description

This naively "inflates" word counts into a bag of words, for sending to MALLET.

Usage

wordcounts_texts(counts, shuffle = FALSE, sep = " ")

Arguments

counts

long-format data frame like that returned by read_wordcounts

shuffle

if TRUE, randomize word order within document before pasting it together. FALSE by default.

sep

word separator in inflated bags. A space, by default.

Details

You can directly pass the result from link{read_wordcounts} to this function, but normally you'll want to filter or otherwise manipulate the words first.

It is not straightforward to supply feature vectors directly to MALLET; MALLET really wants to featurize each text itself. So our task is to take the wordcounts supplied from DfR and reassemble the texts. If DfR tells us word w occurs N times, we simply paste N copies of w together, separated by spaces (or the value of sep if given). Though LDA should not care about word order, if you are nervous about the effects of the decidedly non-natural ordering of words this produces on the modeling process, you can randomize the word order (it still won't be natural). Thanks to David Mimno for suggesting this via his own mallet code.

A big waste of memory, but this is the simple way to get DfR files into MALLET.

Value

a dataframe with two columns: id, the document id; text, the full document text as a single line (with the words in meaningless order)

See Also

read_wordcounts


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.