gibbs_sldax: Fit supervised or unsupervised topic models (SLDAX or LDA)

Description Usage Arguments Details Value See Also Examples

View source: R/helper-functions.R

Description

gibbs_sldax() is used to fit both supervised and unsupervised topic models.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
gibbs_sldax(
  formula,
  data,
  m = 100,
  burn = 0,
  thin = 1,
  docs,
  V,
  K = 2L,
  model = c("lda", "slda", "sldax", "slda_logit", "sldax_logit"),
  sample_beta = TRUE,
  sample_theta = TRUE,
  interaction_xcol = -1L,
  alpha_ = 1,
  gamma_ = 1,
  mu0 = NULL,
  sigma0 = NULL,
  a0 = NULL,
  b0 = NULL,
  eta_start = NULL,
  constrain_eta = FALSE,
  proposal_sd = NULL,
  return_assignments = FALSE,
  correct_ls = TRUE,
  verbose = FALSE,
  display_progress = FALSE
)

Arguments

formula

An object of class formula: a symbolic description of the model to be fitted.

data

An optional data frame containing the variables in the model.

m

The number of iterations to run the Gibbs sampler (default: 100).

burn

The number of iterations to discard as the burn-in period (default: 0).

thin

The period of iterations to keep after the burn-in period (default: 1).

docs

A D x max(N_d) matrix of word indices for all documents.

V

The number of unique terms in the vocabulary.

K

The number of topics.

model

A string denoting the type of model to fit. See 'Details'. (default: "lda").

sample_beta

A logical (default = TRUE): If TRUE, the topic-vocabulary distributions are sampled from their full conditional distribution.

sample_theta

A logical (default = TRUE): If TRUE, the topic proportions will be sampled. CAUTION: This can be memory-intensive.

interaction_xcol

EXPERIMENTAL: The column number of the design matrix for the additional predictors for which an interaction with the K topics is desired (default: -1L, no interaction). Currently only supports a single continuous predictor or a two-category categorical predictor represented as a single dummy-coded column.

alpha_

The hyper-parameter for the prior on the topic proportions (default: 1.0).

gamma_

The hyper-parameter for the prior on the topic-specific vocabulary probabilities (default: 1.0).

mu0

An optional q x 1 mean vector for the prior on the regression coefficients. See 'Details'.

sigma0

A q x q variance-covariance matrix for the prior on the regression coefficients. See 'Details'.

a0

The shape parameter for the prior on sigma2 (default: 0.001).

b0

The scale parameter for the prior on sigma2 (default: 0.001).

eta_start

A q x 1 vector of starting values for the regression coefficients.

constrain_eta

A logical (default = FALSE): If TRUE, the regression coefficients will be constrained so that they are in descending order; if FALSE, no constraints will be applied.

proposal_sd

The proposal standard deviations for drawing the regression coefficients, N(0, proposal_sd(j)), j = 1, …, q. Only used for model = "slda_logit" and model = "sldax_logit" (default: 2.38 for all coefficients).

return_assignments

A logical (default = FALSE): If TRUE, returns an N x max N_d x M array of topic assignments in slot @topics. CAUTION: this can be memory-intensive.

correct_ls

Run Stephens (2000) label switching correct algorithm on posterior? (default = TRUE).

verbose

Should parameter draws be output during sampling? (default: FALSE).

display_progress

Show progress bar? (default: FALSE). Do not use with verbose = TRUE.

Details

The number of regression coefficients q in supervised topic models is determined as follows: For the SLDA model with only the K topics as predictors, q = K; for the SLDAX model with K topics and p additional predictors, there are two possibilities: (1) If no interaction between an additional covariate and the K topics is desired (default: interaction_xcol = -1L), q = p + K; (2) if an interaction between an additional covariate and the K topics is desired (e.g., interaction_xcol = 1), q = p + 2K - 1. If you supply custom values for prior parameters mu0 or sigma0, be sure that the length of mu0 (q) and/or the number of rows and columns of sigma0 (q \times q) are correct. If you supply custom starting values for eta_start, be sure that the length of eta_start is correct.

For model, one of c("lda", "slda", "sldax", "slda_logit", "sldax_logit").

For mu0, the first p elements correspond to coefficients for the p additional predictors (if none, p = 0), while elements p + 1 to p + K correspond to coefficients for the K topics, and elements p + K + 1 to p + 2K - 1 correspond to coefficients for the interaction (if any) between one additional predictor and the K topics. By default, we use a vector of q 0s.

For sigma0, the first p rows/columns correspond to coefficients for the p additional predictors (if none, p = 0), while rows/columns p + 1 to p + K correspond to coefficients for the K topics, and rows/columns p + K + 1 to p + 2K - 1 correspond to coefficients for the interaction (if any) between one additional predictor and the K topics. By default, we use an identity matrix for model = "slda" and model = "sldax" and a diagonal matrix with diagonal elements (variances) of 6.25 for model = "slda_logit" and model = "sldax_logit".

Value

An object of class Sldax.

See Also

Other Gibbs sampler: gibbs_logistic(), gibbs_mlr()

Examples

1
2
3
4
5
6
7
8
library(lda) # Required if using `prep_docs()`

data(teacher_rate)  # Synthetic student ratings of instructors
docs_vocab <- prep_docs(teacher_rate, "doc")
vocab_len <- length(docs_vocab$vocab)
m1 <- gibbs_sldax(rating ~ I(grade - 1), m = 2,
                  data = teacher_rate, docs = docs_vocab$documents,
                  V = vocab_len, K = 2, model = "sldax")

ktw5691/psychtm documentation built on Nov. 3, 2021, 9:10 a.m.