multiSTM | R Documentation |
This function performs a suite of tests aimed at assessing the global behavior of an STM model, which may have multiple modes. The function takes in a collection of differently initialized STM fitted objects and selects a reference model against which all others are benchmarked for stability. The function returns an output of S3 class 'MultimodDiagnostic', with associated plotting methods for quick inspection of the test results.
multiSTM(
mod.out = NULL,
ref.model = NULL,
align.global = FALSE,
mass.threshold = 1,
reg.formula = NULL,
metadata = NULL,
reg.nsims = 100,
reg.parameter.index = 2,
verbose = TRUE,
from.disk = FALSE
)
mod.out |
The output of a |
ref.model |
An integer referencing the element of the list in
|
align.global |
A boolean parameter specifying how to align the topics
of two different STM fitted models. The alignment is performed by solving
the linear sum assignment problem using the Hungarian algorithm. If
|
mass.threshold |
A parameter specifying the portion of the probability
mass of topics to be used for model analysis. The tail of the probability
mass is disregarded accordingly. If |
reg.formula |
A formula for estimating a regression for each model in
the ensemble, where the documents are the units, the outcome is the
proportion of each document about a topic in an STM model, and the
covariates are the document-level metadata. The formula should have an
integer or a vector of numbers on the left-hand side, and an equation with
covariates on the right-hand side. If the left-hand side is left blank, the
regression is performed on all topics in the model. The formula is
exclusively used for building calls to |
metadata |
A dataframe where the predictor variables in
|
reg.nsims |
The number of simulated draws from the variational
posterior for each call of |
reg.parameter.index |
If |
verbose |
If set to |
from.disk |
If set to |
The purpose of this function is to automate and generalize the stability
analysis routines for topic models that are introduced in Roberts, Margaret
E., Brandon M. Stewart, and Dustin Tingley: "Navigating the Local Modes of
Big Data: The Case of Topic Models" (2014). For more detailed discussion
regarding the background and motivation for multimodality analysis, please
refer to the original article. See also the documentation for
plot.MultimodDiagnostic
for help with the plotting methods
associated with this function.
An object of 'MultimodDiagnostic' S3 class, consisting of a list with the following components:
N |
The number of fitted models in the list of model outputs that was supplied to the function for the purpose of stability analysis. |
K |
The number of topics in the models. |
glob.max |
The index of the reference model in the list of model
outputs ( |
lb |
A list of the maximum bound value at convergence for each of the fitted models in the list of model outputs. The list has length N. |
lmat |
A K-by-N matrix reporting the L1-distance of each topic from the corresponding one in the reference model. This is defined as:
Where the beta matrices are the topic-word matrices for the reference and the candidate model. |
tmat |
A K-by-N matrix reporting the number of "top documents" shared by the reference model and the candidate model. The "top documents" for a given topic are defined as the 10 documents in the reference corpus with highest topical frequency. |
wmat |
A K-by-N matrix reporting the number of "top words" shared by the reference model and the candidate model. The "top words" for a given topic are defined as the 10 highest-frequency words. |
lmod |
A vector of length N consisting of the row sums of the
|
tmod |
A vector of length N consisting of the row
sums of the |
wmod |
A vector of length N consisting
of the row sums of the |
semcoh |
Semantic coherence values for each topic within each model in the list of model outputs. |
L1mat |
A K-by-N matrix reporting the limited-mass L1-distance of each
topic from the corresponding one in the reference model. Similar to
|
L1mod |
A vector of length N
consisting of the row means of the |
mass.threshold |
The mass threshold argument that was supplied to the function. |
cov.effects |
A list of length N containing the output of
the run of |
var.matrix |
A K-by-N matrix containing the estimated variance for each
of the fitted regression parameters. |
confidence.ratings |
A vector of length N, where each entry specifies the proportion of regression coefficient estimates in a candidate model that fall within the .95 confidence interval for the corresponding estimate in the reference model. |
align.global |
The alignment control argument that was supplied to the function. |
reg.formula |
The regression formula that was supplied to the function. |
reg.nsims |
The
|
reg.parameter.index |
The |
Antonio Coppola (Harvard University), Brandon Stewart (Princeton University), Dustin Tingley (Harvard University)
Roberts, M., Stewart, B., & Tingley, D. (2016). "Navigating the Local Modes of Big Data: The Case of Topic Models. In Data Analytics in Social Science, Government, and Industry." New York: Cambridge University Press.
plot.MultimodDiagnostic
selectModel
estimateEffect
## Not run:
# Example using Gadarian data
temp<-textProcessor(documents=gadarian$open.ended.response,
metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
mod.out <- selectModel(docs, vocab, K=3,
prevalence=~treatment + s(pid_rep),
data=meta, runs=20)
out <- multiSTM(mod.out, mass.threshold = .75,
reg.formula = ~ treatment,
metadata = gadarian)
plot(out)
# Same example as above, but loading from disk
mod.out <- selectModel(docs, vocab, K=3,
prevalence=~treatment + s(pid_rep),
data=meta, runs=20, to.disk=T)
out <- multiSTM(from.disk=T, mass.threshold = .75,
reg.formula = ~ treatment,
metadata = gadarian)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.