multiSTM: Analyze Stability of Local STM Mode
In bstewart/stm: Estimation of the Structural Topic Model

multiSTM

R Documentation

Analyze Stability of Local STM Mode

Description

This function performs a suite of tests aimed at assessing the global behavior of an STM model, which may have multiple modes. The function takes in a collection of differently initialized STM fitted objects and selects a reference model against which all others are benchmarked for stability. The function returns an output of S3 class 'MultimodDiagnostic', with associated plotting methods for quick inspection of the test results.

Usage

multiSTM(
  mod.out = NULL,
  ref.model = NULL,
  align.global = FALSE,
  mass.threshold = 1,
  reg.formula = NULL,
  metadata = NULL,
  reg.nsims = 100,
  reg.parameter.index = 2,
  verbose = TRUE,
  from.disk = FALSE
)

Arguments

`mod.out`	The output of a `selectModel()` run. This is a list of model outputs the user has to choose from, which all take the same form as the output from a STM model. Currently only works with models without content covariates.
`ref.model`	An integer referencing the element of the list in `mod.out` which contains the desired reference model. When set to the default value of `NULL` this chooses the model with the largest value of the approximate variational bound.
`align.global`	A boolean parameter specifying how to align the topics of two different STM fitted models. The alignment is performed by solving the linear sum assignment problem using the Hungarian algorithm. If `align.global` is set to `TRUE`, the Hungarian algorithm is run globally on the topic-word matrices of the two models that are being compared. The rows of the matrices are aligned such as to minimize the sum of their inner products. This results in each topic in the current runout being matched to a unique topic in the reference model. If `align.global` is, conversely, set to `FALSE`, the alignment problem is solved locally. Each topic in the current runout is matched to the one topic in the reference models that yields minimum inner product. This means that multiple topics in the current runout can be matched to a single topic in the reference model, and does not guarantee that all the topics in the reference model will be matched.
`mass.threshold`	A parameter specifying the portion of the probability mass of topics to be used for model analysis. The tail of the probability mass is disregarded accordingly. If `mass.threshold` is different from 1, both the full-mass and partial-mass analyses are carried out.
`reg.formula`	A formula for estimating a regression for each model in the ensemble, where the documents are the units, the outcome is the proportion of each document about a topic in an STM model, and the covariates are the document-level metadata. The formula should have an integer or a vector of numbers on the left-hand side, and an equation with covariates on the right-hand side. If the left-hand side is left blank, the regression is performed on all topics in the model. The formula is exclusively used for building calls to `estimateEffect()`, so see the documentation for `estimateEffect()` for greater detail about the regression procedure. If `reg.formula` is null, the covariate effect stability analysis routines are not performed. The regressions incorporate uncertainty by using an approximation to the average covariance matrix formed using the global parameters.
`metadata`	A dataframe where the predictor variables in `reg.formula` can be found. It is necessary to include this argument if `reg.formula` is specified.
`reg.nsims`	The number of simulated draws from the variational posterior for each call of `estimateEffect()`. Defaults to 100.
`reg.parameter.index`	If `reg.formula` is specified, the function analyzes the stability across runs of the regression coefficient for one particular predictor variable. This argument specifies which predictor variable is to be analyzed. A value of 1 corresponds to the intercept, a value of 2 correspond to the first predictor variable in `reg.formula`, and so on. Support for multiple concurrent covariate effect stability analyses is forthcoming.
`verbose`	If set to `TRUE`, the function will report progress.
`from.disk`	If set to `TRUE`, `multiSTM()` will load the input models from disk rather than from RAM. This option is particularly useful for dealing with large numbers of models, and is intended to be used in conjunction with the `to.disk` option of `selectModel()`. `multiSTM()` inspects the current directory for RData files.

Details

The purpose of this function is to automate and generalize the stability analysis routines for topic models that are introduced in Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley: "Navigating the Local Modes of Big Data: The Case of Topic Models" (2014). For more detailed discussion regarding the background and motivation for multimodality analysis, please refer to the original article. See also the documentation for plot.MultimodDiagnostic for help with the plotting methods associated with this function.

Value

An object of 'MultimodDiagnostic' S3 class, consisting of a list with the following components:

`N`	The number of fitted models in the list of model outputs that was supplied to the function for the purpose of stability analysis.
`K`	The number of topics in the models.
`glob.max`	The index of the reference model in the list of model outputs (`mod.out`) that was supplied to the function. The reference model is selected as the one with the maximum bound value at convergence.
`lb`	A list of the maximum bound value at convergence for each of the fitted models in the list of model outputs. The list has length N.
`lmat`	A K-by-N matrix reporting the L1-distance of each topic from the corresponding one in the reference model. This is defined as: `L_{1}=\sum_{v}\|\beta_{k,v}^{ref}-\beta_{k,v}^{cand}\|` Where the beta matrices are the topic-word matrices for the reference and the candidate model.
`tmat`	A K-by-N matrix reporting the number of "top documents" shared by the reference model and the candidate model. The "top documents" for a given topic are defined as the 10 documents in the reference corpus with highest topical frequency.
`wmat`	A K-by-N matrix reporting the number of "top words" shared by the reference model and the candidate model. The "top words" for a given topic are defined as the 10 highest-frequency words.
`lmod`	A vector of length N consisting of the row sums of the `lmat` matrix.
`tmod`	A vector of length N consisting of the row sums of the `tmat` matrix.
`wmod`	A vector of length N consisting of the row sums of the `wmat` matrix.
`semcoh`	Semantic coherence values for each topic within each model in the list of model outputs.
`L1mat`	A K-by-N matrix reporting the limited-mass L1-distance of each topic from the corresponding one in the reference model. Similar to `lmat`, but computed using only the top portion of the probability mass for each topic, as specified by the `mass.threshol` parameter. `NULL` if `mass.treshold==1`.
`L1mod`	A vector of length N consisting of the row means of the `L1mat` matrix.
`mass.threshold`	The mass threshold argument that was supplied to the function.
`cov.effects`	A list of length N containing the output of the run of `estimateEffect()` on each candidate model with the given regression formula. `NULL` if no regression formula is given.
`var.matrix`	A K-by-N matrix containing the estimated variance for each of the fitted regression parameters. `NULL` if no regression formula is given.
`confidence.ratings`	A vector of length N, where each entry specifies the proportion of regression coefficient estimates in a candidate model that fall within the .95 confidence interval for the corresponding estimate in the reference model.
`align.global`	The alignment control argument that was supplied to the function.
`reg.formula`	The regression formula that was supplied to the function.
`reg.nsims`	The `reg.nsims` argument that was supplied to the function.
`reg.parameter.index`	The `reg.parameter.index` argument that was supplied to the function.

Author(s)

Antonio Coppola (Harvard University), Brandon Stewart (Princeton University), Dustin Tingley (Harvard University)

References

Roberts, M., Stewart, B., & Tingley, D. (2016). "Navigating the Local Modes of Big Data: The Case of Topic Models. In Data Analytics in Social Science, Government, and Industry." New York: Cambridge University Press.

Examples


## Not run: 

# Example using Gadarian data
temp<-textProcessor(documents=gadarian$open.ended.response, 
                    metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
mod.out <- selectModel(docs, vocab, K=3, 
                       prevalence=~treatment + s(pid_rep), 
                       data=meta, runs=20)

out <- multiSTM(mod.out, mass.threshold = .75, 
                reg.formula = ~ treatment,
                metadata = gadarian)
plot(out)

# Same example as above, but loading from disk
mod.out <- selectModel(docs, vocab, K=3, 
                       prevalence=~treatment + s(pid_rep), 
                       data=meta, runs=20, to.disk=T)

out <- multiSTM(from.disk=T, mass.threshold = .75, 
                reg.formula = ~ treatment,
                metadata = gadarian)

## End(Not run)

bstewart/stm documentation built on Jan. 3, 2024, 6:58 p.m.