foreign_model: Glue for results from other topic-modeling packages

foreign_modelR Documentation

Glue for results from other topic-modeling packages

Description

Besides this package and mallet, which it builds on, there are several other topic-modeling packages for R. topicmodels provides a topic-modeling infrastructure as well as supplying functions for estimating both ordinary LDA and Correlated Topic Models several ways. I have tried to make it possible to use at least some of dfrtopics's functions with results from topicmodels' LDA and CTM functions. I have also wished to make it possible to interface with the stm package and its Structural Topic Model (stm). Given a model from one of these two packages, apply foreign_model to obtain an object that can be used with (some of) the functions in dfrtopics. Use unwrap to get back the original model object.

Usage

foreign_model(x, metadata = NULL)

unwrap(x)

Arguments

x

model for translation from topicmodels or stm

metadata

metadata frame to attach to model. For converting from stm, supply the same metadata as was given to stm. Conversion from LDA can use a superset of the document metadata, provided the rownames of the modeled DocumentTermMatrix can be matched against metadata$id.

Details

Most of this package emerged out of my particular need to wrangle MALLET, and as a result I did not take account of the topicmodels infrastructure (which, furthermore, has been refined over time). I wish I had, since that infrastructure is elegant and extensible, using S4 rather than S3. For now, I am not going to overhaul my own class structure. As a stopgap, the strategy adopted here is to provide "wrapper" objects for TopicModel-class and stm objects that can respond to many of the same messages as mallet_model does. This is not the best way to do things, but it's straightforward.

Not all functionality is supported. Anything that requires MALLET's assignments of topics to individual words (the "sampling state") does not at present work. Note too that doc_topics and topic_words applied to a TopicModel or an stm return parameter estimates of the probabilities of topics in documents or words in topics. In MALLET terminology these are "smoothed and normalized," not raw sampling weights. For this reason hyperparameters does not return true hyperparameter values for these models—which are, in any case, defined variously for the various estimation procedures. Instead, hyperparameters returns dummy values of zero so that tw_smooth_normalize and dt_smooth_normalize will not incorrectly add anything to the posteriors. The actual hyperparameters should be retrieved from the underlying model if needed.

align_topics will work with glue objects and should help compare variant models and estimation strategies.

It is possible to apply dfr_browser to a glue object to explore a model, with two caveats. First, the implication of using the normalized posteriors is that all documents are given equal weight in the display, whereas the display of a model from mallet by default weights documents by their lengths; for a more comparable display of a mallet model m, use dfr_browser(m, proper=T). Second, at present the display of an stm object will not use any explicit estimates of the effects of time covariates. It just takes the average estimated topic proportion of all documents in each year. To examine the actual estimates, together with uncertainties, the estimateEffect method should be used, or the interactive visualization provided by the stmBrowser package, for which the kludges here are no substitute.

Value

A wrapper object which will work with most functions of an object of class mallet_model.

See Also

wordcounts_DocumentTermMatrix and wordcounts_stm_inputs to prepare wordcount data for input to these other packages' modeling procedures.

Examples


## Not run: 
# aligning three models from three packages

counts <- read_wordcounts(...) # etc.
meta <- read_dfr_metadata(...) # etc.

library(stm)
corpus <- wordcounts_stm_inputs(counts, meta)
m_stm <- stm(documents=corpus$documents,
    vocab=corpus$vocab,
    data=corpus$data,
    K=25, prevalence= ~ s(journaltitle))
m_stm_glue <- foreign_model(m_stm, corp$data)

library(topicmodels)
dtm <- wordcounts_DocumentTermMatrix(counts)
m_lda <- LDA(dtm,
    k=25, control=list(alpha=0.1))
m_lda_glue <- foreign_model(m_lda, meta)

insts <- wordcounts_instances(counts)
m_mallet <- train_model(insts, n_topics=25,
    metadata=meta)

model_distances(list(m_stm_glue, m_lda_glue, m_mallet), 100) %>%
align_topics() %>%
alignment_frame()

## End(Not run)


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.