alignCorpus: Align the vocabulary of a new corpus to an old corpus

View source: R/alignCorpus.R

alignCorpusR Documentation

Align the vocabulary of a new corpus to an old corpus

Description

Function that takes in a list of documents, vocab and (optionally) metadata for a corpus of previously unseen documents and aligns them to an old vocabulary. Helps preprocess documents for fitNewDocuments.

Usage

alignCorpus(new, old.vocab, verbose = TRUE)

Arguments

new

a list (such as those produced by textProcessor or prepDocuments) containing a list of documents in stm format, a character vector containing the vocabulary and optional a data.frame containing meta data. These should be labeled documents, vocab,and meta respectively. This is the new set of unseen documents which will be returned with the vocab renumbered and all words not appearing in old removed.

old.vocab

a character vector containing the vocabulary that you want to align to. In general this will be the vocab used in your original stm model fit which from an stm object called mod can be accessed as mod$vocab.

verbose

a logical indicating whether information about the new corpus should be printed to the screen. Defaults to TRUE.

Details

When estimating topic proportions for previously unseen documents using fitNewDocuments the new documents must have the same vocabulary ordered in the same was as the original model. This function helps with that process.

Note: the code is not really built for speed or memory efficiency- if you are trying to do this with a really large corpus of new texts you might consider building the object yourself using quanteda or some other option.

Value

documents

A list containing the documents in the stm format.

vocab

Character vector of vocabulary.

meta

Data frame or matrix containing the user-supplied metadata for the retained documents.

docs.removed

document indices (corresponding to the original data passed) of documents removed because they contain no words

words.removed

words dropped from new

tokens.removed

the total number of tokens dropped from the new documents.

wordcounts

counts of times the old vocab appears in the new documents

prop.overlap

length two vector used to populate the message printed by verbose.

See Also

prepDocuments fitNewDocuments

Examples

#we process an original set that is just the first 100 documents
temp<-textProcessor(documents=gadarian$open.ended.response[1:100],metadata=gadarian[1:100,])
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
set.seed(02138)
#Maximum EM its is set low to make this run fast, run models to convergence!
mod.out <- stm(out$documents, out$vocab, 3, prevalence=~treatment + s(pid_rep), 
              data=out$meta, max.em.its=5)
#now we process the remaining documents
temp<-textProcessor(documents=gadarian$open.ended.response[101:nrow(gadarian)],
                    metadata=gadarian[101:nrow(gadarian),])
#note we don't run prepCorpus here because we don't want to drop any words- we want 
#every word that showed up in the old documents.
newdocs <- alignCorpus(new=temp, old.vocab=mod.out$vocab)
#we get some helpful feedback on what has been retained and lost in the print out.
#and now we can fit our new held-out documents
fitNewDocuments(model=mod.out, documents=newdocs$documents, newData=newdocs$meta,
                origData=out$meta, prevalence=~treatment + s(pid_rep),
                prevalencePrior="Covariate")

bstewart/stm documentation built on Jan. 3, 2024, 6:58 p.m.