tidylda: Fit a Latent Dirichlet Allocation topic model

View source: R/tidylda-fit-methods.R

tidyldaR Documentation

Fit a Latent Dirichlet Allocation topic model

Description

Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.

Usage

tidylda(
  data,
  k,
  iterations = NULL,
  burnin = -1,
  alpha = 0.1,
  eta = 0.05,
  optimize_alpha = FALSE,
  calc_likelihood = TRUE,
  calc_r2 = FALSE,
  threads = 1,
  return_data = FALSE,
  verbose = TRUE,
  ...
)

Arguments

data

A document term matrix or term co-occurrence matrix. The preferred class is a dgCMatrix-class. However there is support for any Matrix-class object as well as several other commonly-used classes such as matrix, dfm, DocumentTermMatrix, and simple_triplet_matrix

k

Integer number of topics.

iterations

Integer number of iterations for the Gibbs sampler to run.

burnin

Integer number of burnin iterations. If burnin is greater than -1, the resulting "beta" and "theta" matrices are an average over all iterations greater than burnin.

alpha

Numeric scalar or vector of length k. This is the prior for topics over documents.

eta

Numeric scalar, numeric vector of length ncol(data), or numeric matrix with k rows and ncol(data) columns. This is the prior for words over topics.

optimize_alpha

Logical. Do you want to optimize alpha every iteration? Defaults to FALSE. See 'details' below for more information.

calc_likelihood

Logical. Do you want to calculate the log likelihood every iteration? Useful for assessing convergence. Defaults to TRUE.

calc_r2

Logical. Do you want to calculate R-squared after the model is trained? Defaults to FALSE. See calc_lda_r2.

threads

Number of parallel threads, defaults to 1. See Details, below.

return_data

Logical. Do you want data returned as part of the model object?

verbose

Logical. Do you want to print a progress bar out to the console? Defaults to TRUE.

...

Additional arguments, currently unused

Details

This function calls a collapsed Gibbs sampler for Latent Dirichlet Allocation written using the excellent Rcpp package. Some implementation notes follow:

Topic-token and topic-document assignments are not initialized based on a uniform-random sampling, as is common. Instead, topic-token probabilities (i.e. beta) are initialized by sampling from a Dirichlet distribution with eta as its parameter. The same is done for topic-document probabilities (i.e. theta) using alpha. Then an internal function is called (initialize_topic_counts) to run a single Gibbs iteration to initialize assignments of tokens to topics and topics to documents.

When you use burn-in iterations (i.e. burnin = TRUE), the resulting beta and theta matrices are calculated by averaging over every iteration after the specified number of burn-in iterations. If you do not use burn-in iterations, then the matrices are calculated from the last run only. Ideally, you'd burn in every iteration before convergence, then average over the chain after its converged (and thus every observation is independent).

If you set optimize_alpha to TRUE, then each element of alpha is proportional to the number of times each topic has be sampled that iteration averaged with the value of alpha from the previous iteration. This lets you start with a symmetric alpha and drift into an asymmetric one. However, (a) this probably means that convergence will take longer to happen or convergence may not happen at all. And (b) I make no guarantees that doing this will give you any benefit or that it won't hurt your model. Caveat emptor!

The log likelihood calculation is the same that can be found on page 9 of https://arxiv.org/pdf/1510.08628.pdf. The only difference is that the version in tidylda allows eta to be a vector or matrix. (Vector used in this function, matrix used for model updates in refit.tidylda. At present, the log likelihood function appears to be ok for assessing convergence. i.e. It has the right shape. However, it is, as of this writing, returning positive numbers, rather than the expected negative numbers. Looking into that, but in the meantime caveat emptor once again.

Parallelism, is not currently implemented. The threads argument is a placeholder for planned enhancements.

Value

Returns an S3 object of class tidylda. See new_tidylda.

Examples

# load some data
data(nih_sample_dtm)

# fit a model
set.seed(12345)
m <- tidylda(
  data = nih_sample_dtm[1:20, ], k = 5,
  iterations = 200, burnin = 175
)

str(m)

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100, ],
  method = "gibbs",
  iterations = 200, burnin = 175
)

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100, ], method = "dot")

# compare the methods
barplot(rbind(p1[1, ], p2[1, ]), beside = TRUE, col = c("red", "blue"))

tidylda documentation built on July 26, 2023, 5:34 p.m.