wordcounts_texts | R Documentation |
This naively "inflates" word counts into a bag of words, for sending to MALLET.
wordcounts_texts(counts, shuffle = FALSE, sep = " ")
counts |
long-format data frame like that returned by
|
shuffle |
if |
sep |
word separator in inflated bags. A space, by default. |
You can directly pass the result from link{read_wordcounts}
to this
function, but normally you'll want to filter or otherwise manipulate the
words first.
It is not straightforward to supply feature vectors directly to MALLET;
MALLET really wants to featurize each text itself. So our task is to take the
wordcounts supplied from DfR and reassemble the texts. If DfR tells us word w
occurs N times, we simply paste N copies of w together, separated by spaces
(or the value of sep
if given). Though LDA should not care about word
order, if you are nervous about the effects of the decidedly non-natural
ordering of words this produces on the modeling process, you can randomize
the word order (it still won't be natural). Thanks to David Mimno for
suggesting this via his own mallet
code.
A big waste of memory, but this is the simple way to get DfR files into MALLET.
a dataframe with two columns: id
, the document id;
text
, the full document text as a single line (with the words in
meaningless order)
read_wordcounts
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.