simplify_state | R Documentation |
This function reads in the Gibbs sampling state output by MALLET (a gzipped
text file) and writes a CSV file giving the number of times each word type
in each document is assigned to each document. Because the MALLET state file
is often too big to handle in memory all at once, the "simplification" is
done by reading and writing in chunks. This will not be as fast as it should
be (arRgh!); on a fast personal computer, performing this operation on a
model of a 60 million-word corpus takes six or seven minutes. This function
is not meant to be called directly; the main interfaces to the Gibbs
sampling state output from MALLET are load_sampling_state
and
load_from_mallet_state
(which call this function when needed).
simplify_state( state_file, outfile, chunk_size = getOption("dfrtopics.state_chunk_size") )
state_file |
the MALLET state file. Supply either a file name or a connection |
outfile |
the name of the output file (will be clobbered) |
chunk_size |
number of lines to read at a time (sometimes multiple
chunks are written at once). The total number of lines to read is the total
number of tokens (plus three). A count of chunks read is displayed unless
the package option |
The resulting file has a header document,word,topic,count
describing
its columns. Note that this file uses zero-based indices for topics, words,
and documents, not 1-based indices. It can be loaded with
read_sampling_state
, but the recommended interface is
load_sampling_state
(q.v.).
This function formerly relied on a Python script, but in order to reduce
external dependencies it now uses R code only. However, R's gzip support is
somewhat flaky. If this function reports errors from gzcon
or
zlib
or similar, try manually decompressing the file and passing
state_file=file("unzipped-state.txt")
.
load_sampling_state
, sampling_state
,
read_sampling_state
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.