alignCorpus: Align the vocabulary of a new corpus to an old corpus
In stm: Estimation of the Structural Topic Model

alignCorpus

R Documentation

Align the vocabulary of a new corpus to an old corpus

Description

Function that takes in a list of documents, vocab and (optionally) metadata for a corpus of previously unseen documents and aligns them to an old vocabulary. Helps preprocess documents for fitNewDocuments.

Usage

alignCorpus(new, old.vocab, verbose = TRUE)

Arguments

`new`	a list (such as those produced by `textProcessor` or `prepDocuments`) containing a list of documents in `stm` format, a character vector containing the vocabulary and optional a `data.frame` containing meta data. These should be labeled `documents`, `vocab`,and `meta` respectively. This is the new set of unseen documents which will be returned with the vocab renumbered and all words not appearing in `old` removed.
`old.vocab`	a character vector containing the vocabulary that you want to align to. In general this will be the vocab used in your original stm model fit which from an stm object called `mod` can be accessed as `mod$vocab`.
`verbose`	a logical indicating whether information about the new corpus should be printed to the screen. Defaults to `TRUE`.

Details

When estimating topic proportions for previously unseen documents using fitNewDocuments the new documents must have the same vocabulary ordered in the same was as the original model. This function helps with that process.

Note: the code is not really built for speed or memory efficiency- if you are trying to do this with a really large corpus of new texts you might consider building the object yourself using quanteda or some other option.

Value

`documents`	A list containing the documents in the stm format.
`vocab`	Character vector of vocabulary.
`meta`	Data frame or matrix containing the user-supplied metadata for the retained documents.
`docs.removed`	document indices (corresponding to the original data passed) of documents removed because they contain no words
`words.removed`	words dropped from `new`
`tokens.removed`	the total number of tokens dropped from the new documents.
`wordcounts`	counts of times the old vocab appears in the new documents
`prop.overlap`	length two vector used to populate the message printed by verbose.

Examples

#we process an original set that is just the first 100 documents
temp<-textProcessor(documents=gadarian$open.ended.response[1:100],metadata=gadarian[1:100,])
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
set.seed(02138)
#Maximum EM its is set low to make this run fast, run models to convergence!
mod.out <- stm(out$documents, out$vocab, 3, prevalence=~treatment + s(pid_rep), 
              data=out$meta, max.em.its=5)
#now we process the remaining documents
temp<-textProcessor(documents=gadarian$open.ended.response[101:nrow(gadarian)],
                    metadata=gadarian[101:nrow(gadarian),])
#note we don't run prepCorpus here because we don't want to drop any words- we want 
#every word that showed up in the old documents.
newdocs <- alignCorpus(new=temp, old.vocab=mod.out$vocab)
#we get some helpful feedback on what has been retained and lost in the print out.
#and now we can fit our new held-out documents
fitNewDocuments(model=mod.out, documents=newdocs$documents, newData=newdocs$meta,
                origData=out$meta, prevalence=~treatment + s(pid_rep),
                prevalencePrior="Covariate")

stm documentation built on June 24, 2024, 5:18 p.m.