simplify_state: Reduce a MALLET sampling state on disk to a simplified form

simplify_stateR Documentation

Reduce a MALLET sampling state on disk to a simplified form

Description

This function reads in the Gibbs sampling state output by MALLET (a gzipped text file) and writes a CSV file giving the number of times each word type in each document is assigned to each document. Because the MALLET state file is often too big to handle in memory all at once, the "simplification" is done by reading and writing in chunks. This will not be as fast as it should be (arRgh!); on a fast personal computer, performing this operation on a model of a 60 million-word corpus takes six or seven minutes. This function is not meant to be called directly; the main interfaces to the Gibbs sampling state output from MALLET are load_sampling_state and load_from_mallet_state (which call this function when needed).

Usage

simplify_state(
  state_file,
  outfile,
  chunk_size = getOption("dfrtopics.state_chunk_size")
)

Arguments

state_file

the MALLET state file. Supply either a file name or a connection

outfile

the name of the output file (will be clobbered)

chunk_size

number of lines to read at a time (sometimes multiple chunks are written at once). The total number of lines to read is the total number of tokens (plus three). A count of chunks read is displayed unless the package option dfrtopics.verbose is FALSE. The chunk size appears to make little difference to performance.

Details

The resulting file has a header document,word,topic,count describing its columns. Note that this file uses zero-based indices for topics, words, and documents, not 1-based indices. It can be loaded with read_sampling_state, but the recommended interface is load_sampling_state (q.v.).

This function formerly relied on a Python script, but in order to reduce external dependencies it now uses R code only. However, R's gzip support is somewhat flaky. If this function reports errors from gzcon or zlib or similar, try manually decompressing the file and passing state_file=file("unzipped-state.txt").

See Also

load_sampling_state, sampling_state, read_sampling_state


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.