searchK | R Documentation |
With user-specified initialization, this function runs selectModel for different user-specified topic numbers and computes diagnostic properties for the returned model. These include exclusivity, semantic coherence, heldout likelihood, bound, lbound, and residual dispersion.
searchK(
documents,
vocab,
K,
init.type = "Spectral",
N = floor(0.1 * length(documents)),
proportion = 0.5,
heldout.seed = NULL,
M = 10,
cores = 1,
...
)
documents |
The documents to be used for the stm model |
vocab |
The vocabulary to be used for the stmmodel |
K |
A vector of different topic numbers |
init.type |
The method of initialization. See |
N |
Number of docs to be partially held out |
proportion |
Proportion of docs to be held out. |
heldout.seed |
If desired, a seed to use when holding out documents for later heldout likelihood computation |
M |
M value for exclusivity computation |
cores |
Number of CPUs to use for parallel computation |
... |
Other diagnostics parameters. |
See the vignette for interpretation of each of these measures. Each of these measures is also available in exported functions:
exclusivity
semanticCoherence
make.heldout
and eval.heldout
calculated by stm
accessible by max(model$convergence$bound)
a correction to the bound that makes the bounds directly comparable max(model$convergence$bound) + lfactorial(model$settings$dim$K)
checkResiduals
Due to the need to calculate the heldout-likelihood N
documents have
proportion
of the documents heldout at random. This means that even
with the default spectral initialization the results can change from run to run.
When the number of heldout documents is low or documents are very short, this also
means that the results can be quite unstable. For example: the gadarian
code
demonstration below has heldout results based on only 34 documents and approximately
150 tokens total. Clearly this can lead to quite disparate results across runs. By
contrast default settings for the poliblog5k
dataset would yield a heldout sample
of 500 documents with approximately 50000 tokens for the heldout sample. We should expect
this to be substantially more stable.
exclus |
Exclusivity of each model. |
semcoh |
Semantic coherence of each model. |
heldout |
Heldout likelihood for each model. |
residual |
Residual for each model. |
bound |
Bound for each model. |
lbound |
lbound for each model. |
em.its |
Total number of EM iterations used in fiting the model. |
plot.searchK
make.heldout
K<-c(5,10,15)
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
documents <- out$documents
vocab <- out$vocab
meta <- out$meta
set.seed(02138)
K<-c(5,10,15)
kresult <- searchK(documents, vocab, K, prevalence=~treatment + s(pid_rep), data=meta)
plot(kresult)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.