rmultinom_sparse: Draw from multinomial distributions

rmultinom_sparseR Documentation

Draw from multinomial distributions

Description

According to the generative model of LDA, documents are drawn from mixtures of multinomial distributions over the vocabulary. When we simulate from the posterior, our task in practice is: for each document d, given the number of words n allocated to topic k in d, generate the result of n multinomial trials with word probabilities given from topic k. This function tries to do this efficiently given a vector of n values (one for each document) and a vector of topic weights, yielding a simulated term-document matrix of within-topic weights.

Usage

rmultinom_sparse(nn, probs)

Arguments

nn

vector of trial sizes: nn[i] gives the number of words to draw in the ith trial.

probs

vector of word weights: probs[j]/sum(probs[j]) gives the probability of word j in a single trial. It need not be normalized.

Details

R's built-in rmultinom has two disadvantages here. First, it is set up to generate many samples, each with the same number of trials. But we require varying the number of trials to correspond to our varying numbers of words allocated to the given topic, so we would have to call rmultinom once for each document and then rbind the results. Second, because the vocabulary can be large and topics typically allocate most of the probability to only a few words, most elements of each sample vector will be zero. But the built-in function cannot take advantage of this sparsity and will require space for a full simulated term-document matrix. This function, by contrast, returns a sparse Matrix.

Note that the parameters are not the same as rmultinom's. The equivalent of rmultinom(n, size, prob) is rmultinom_sparse(rep(size, n), prob).

Value

sparse Matrix of sampled term-document counts, with terms in rows and documents in columns. Notice that this means individual multinomial samples are columns of the returned matrix.

See Also

imi_check and mi_check which use this


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.