read_sampling_state: Read in a Gibbs sampling state

View source: R/sampling_state.R

read_sampling_stateR Documentation

Read in a Gibbs sampling state

Description

This function reads in a Gibbs sampling state represented by document,word,topic,count rows to a big.matrix. This gives the model's assignments of words to topics within documents. MALLET itself remembers token order, but in ordinary LDA the words are assumed exchangeable within documents. The recommended interface to this sampling state is load_sampling_state, which calls this function.

Usage

read_sampling_state(filename, data_type = "integer", big_workdir = tempdir())

Arguments

filename

the name of a CSV file holding the simplified state: a CSV with header row and four columns, document,word,topic,count, where the documents, words, and topics are zero-index. Create the file from MALLET output using simplify_state.

data_type

the C++ type to store the data in. If all values have magnitude less than 2^15, you can get away with "short", but guess what? Linguistic data hates you, and a typical vocabulary can easily include more word types than that, so the default is "integer".

big_workdir

the working directory where read.big.matrix will store its temporary files. By default, uses tempdir, but if you have more scratch space elsewhere, use that for handling large sampling states.

Details

N.B. The MALLET sampling state, and the "simplified state" output by this function to disk, index documents, words, and topics from zero, but the dataframe returned by this function indexes these from one, for convenience within R.

Value

a big.matrix with four columns, document,word,topic,count. Documents, words, and topics are one-indexed in the result, so these values may be used as indices to the vectors returned by doc_ids, vocabulary, doc_topics, etc.

See Also

load_mallet_state, write_mallet_state, tdm_topic, simplify_state, and package bigmemory.


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.