gibbs_sldax: Fit supervised or unsupervised topic models (SLDAX or LDA)
In ktw5691/psychtm: Text Mining Methods for Psychological Research

Description Usage Arguments Details Value See Also Examples

gibbs_sldax() is used to fit both supervised and unsupervised topic models.

gibbs_sldax(
  formula,
  data,
  m = 100,
  burn = 0,
  thin = 1,
  docs,
  V,
  K = 2L,
  model = c("lda", "slda", "sldax", "slda_logit", "sldax_logit"),
  sample_beta = TRUE,
  sample_theta = TRUE,
  interaction_xcol = -1L,
  alpha_ = 1,
  gamma_ = 1,
  mu0 = NULL,
  sigma0 = NULL,
  a0 = NULL,
  b0 = NULL,
  eta_start = NULL,
  constrain_eta = FALSE,
  proposal_sd = NULL,
  return_assignments = FALSE,
  correct_ls = TRUE,
  verbose = FALSE,
  display_progress = FALSE
)

`formula`	An object of class `formula`: a symbolic description of the model to be fitted.
`data`	An optional data frame containing the variables in the model.
`m`	The number of iterations to run the Gibbs sampler (default: `100`).
`burn`	The number of iterations to discard as the burn-in period (default: `0`).
`thin`	The period of iterations to keep after the burn-in period (default: `1`).
`docs`	A D x max(N_d) matrix of word indices for all documents.
`V`	The number of unique terms in the vocabulary.
`K`	The number of topics.
`model`	A string denoting the type of model to fit. See 'Details'. (default: `"lda"`).
`sample_beta`	A logical (default = `TRUE`): If `TRUE`, the topic-vocabulary distributions are sampled from their full conditional distribution.
`sample_theta`	A logical (default = `TRUE`): If `TRUE`, the topic proportions will be sampled. CAUTION: This can be memory-intensive.
`interaction_xcol`	EXPERIMENTAL: The column number of the design matrix for the additional predictors for which an interaction with the K topics is desired (default: `-1L`, no interaction). Currently only supports a single continuous predictor or a two-category categorical predictor represented as a single dummy-coded column.
`alpha_`	The hyper-parameter for the prior on the topic proportions (default: `1.0`).
`gamma_`	The hyper-parameter for the prior on the topic-specific vocabulary probabilities (default: `1.0`).
`mu0`	An optional q x 1 mean vector for the prior on the regression coefficients. See 'Details'.
`sigma0`	A q x q variance-covariance matrix for the prior on the regression coefficients. See 'Details'.
`a0`	The shape parameter for the prior on sigma2 (default: `0.001`).
`b0`	The scale parameter for the prior on sigma2 (default: `0.001`).
`eta_start`	A q x 1 vector of starting values for the regression coefficients.
`constrain_eta`	A logical (default = `FALSE`): If `TRUE`, the regression coefficients will be constrained so that they are in descending order; if `FALSE`, no constraints will be applied.
`proposal_sd`	The proposal standard deviations for drawing the regression coefficients, N(0, proposal_sd(j)), j = 1, …, q. Only used for `model = "slda_logit"` and `model = "sldax_logit"` (default: `2.38` for all coefficients).
`return_assignments`	A logical (default = `FALSE`): If `TRUE`, returns an N x max N_d x M array of topic assignments in slot `@topics`. CAUTION: this can be memory-intensive.
`correct_ls`	Run Stephens (2000) label switching correct algorithm on posterior? (default = `TRUE`).
`verbose`	Should parameter draws be output during sampling? (default: `FALSE`).
`display_progress`	Show progress bar? (default: `FALSE`). Do not use with `verbose = TRUE`.

The number of regression coefficients q in supervised topic models is determined as follows: For the SLDA model with only the K topics as predictors, q = K; for the SLDAX model with K topics and p additional predictors, there are two possibilities: (1) If no interaction between an additional covariate and the K topics is desired (default: interaction_xcol = -1L), q = p + K; (2) if an interaction between an additional covariate and the K topics is desired (e.g., interaction_xcol = 1), q = p + 2K - 1. If you supply custom values for prior parameters mu0 or sigma0, be sure that the length of mu0 (q) and/or the number of rows and columns of sigma0 (q \times q) are correct. If you supply custom starting values for eta_start, be sure that the length of eta_start is correct.

For model, one of c("lda", "slda", "sldax", "slda_logit", "sldax_logit").

"lda": unsupervised topic model;
"slda": supervised topic model with a continuous outcome;
"sldax": supervised topic model with a continuous outcome and additional predictors of the outcome;
"slda_logit": supervised topic model with a dichotomous outcome (0/1);
"sldax_logit": supervised topic model with a dichotomous outcome (0/1) and additional predictors of the outcome.

For mu0, the first p elements correspond to coefficients for the p additional predictors (if none, p = 0), while elements p + 1 to p + K correspond to coefficients for the K topics, and elements p + K + 1 to p + 2K - 1 correspond to coefficients for the interaction (if any) between one additional predictor and the K topics. By default, we use a vector of q 0s.

For sigma0, the first p rows/columns correspond to coefficients for the p additional predictors (if none, p = 0), while rows/columns p + 1 to p + K correspond to coefficients for the K topics, and rows/columns p + K + 1 to p + 2K - 1 correspond to coefficients for the interaction (if any) between one additional predictor and the K topics. By default, we use an identity matrix for model = "slda" and model = "sldax" and a diagonal matrix with diagonal elements (variances) of 6.25 for model = "slda_logit" and model = "sldax_logit".

An object of class Sldax.

Other Gibbs sampler: gibbs_logistic(), gibbs_mlr()

library(lda) # Required if using `prep_docs()`

data(teacher_rate)  # Synthetic student ratings of instructors
docs_vocab <- prep_docs(teacher_rate, "doc")
vocab_len <- length(docs_vocab$vocab)
m1 <- gibbs_sldax(rating ~ I(grade - 1), m = 2,
                  data = teacher_rate, docs = docs_vocab$documents,
                  V = vocab_len, K = 2, model = "sldax")