simplify_state: Reduce a MALLET sampling state on disk to a simplified form
In agoldst/dfrtopics: Tools for exploring topic models of text

simplify_state

R Documentation

Reduce a MALLET sampling state on disk to a simplified form

Description

This function reads in the Gibbs sampling state output by MALLET (a gzipped text file) and writes a CSV file giving the number of times each word type in each document is assigned to each document. Because the MALLET state file is often too big to handle in memory all at once, the "simplification" is done by reading and writing in chunks. This will not be as fast as it should be (arRgh!); on a fast personal computer, performing this operation on a model of a 60 million-word corpus takes six or seven minutes. This function is not meant to be called directly; the main interfaces to the Gibbs sampling state output from MALLET are load_sampling_state and load_from_mallet_state (which call this function when needed).

Usage

simplify_state(
  state_file,
  outfile,
  chunk_size = getOption("dfrtopics.state_chunk_size")
)

Arguments

`state_file`	the MALLET state file. Supply either a file name or a connection
`outfile`	the name of the output file (will be clobbered)
`chunk_size`	number of lines to read at a time (sometimes multiple chunks are written at once). The total number of lines to read is the total number of tokens (plus three). A count of chunks read is displayed unless the package option `dfrtopics.verbose` is FALSE. The chunk size appears to make little difference to performance.

Details

The resulting file has a header document,word,topic,count describing its columns. Note that this file uses zero-based indices for topics, words, and documents, not 1-based indices. It can be loaded with read_sampling_state, but the recommended interface is load_sampling_state (q.v.).

This function formerly relied on a Python script, but in order to reduce external dependencies it now uses R code only. However, R's gzip support is somewhat flaky. If this function reports errors from gzcon or zlib or similar, try manually decompressing the file and passing state_file=file("unzipped-state.txt").

agoldst/dfrtopics
Tools for exploring topic models of text

simplify_state: Reduce a MALLET sampling state on disk to a simplified form
In agoldst/dfrtopics: Tools for exploring topic models of text

Reduce a MALLET sampling state on disk to a simplified form

Description

Usage

Arguments

Details

See Also

Related to simplify_state in agoldst/dfrtopics...

R Package Documentation

Browse R Packages

We want your feedback!

agoldst/dfrtopics Tools for exploring topic models of text

simplify_state: Reduce a MALLET sampling state on disk to a simplified form In agoldst/dfrtopics: Tools for exploring topic models of text

Reduce a MALLET sampling state on disk to a simplified form

Description

Usage

Arguments

Details

See Also

Related to simplify_state in agoldst/dfrtopics...

R Package Documentation

Browse R Packages

We want your feedback!

agoldst/dfrtopics
Tools for exploring topic models of text

simplify_state: Reduce a MALLET sampling state on disk to a simplified form
In agoldst/dfrtopics: Tools for exploring topic models of text