Nothing
#' Fit New Documents
#'
#' A function for predicting thetas for an unseen document based on the previously fit model.
#'
#' Due to the existence of the metadata in the model, this isn't as simple as in models
#' without side information such as Latent Dirichlet Allocation. There are four scenarios:
#' models without covariate information, models with prevalence covariates only, models with
#' content covariates only and models with both. When there is not covariate information the
#' choice is essentially whether or not to use prior information.
#'
#' We offer three types of choices (and may offer more in the future):
#' \describe{
#' \item{"None"}{No prior is used. In the prevalence case this means that the model simply
#' maximizes the likelihood of seeing the words given the word-topic distribution. This will
#' in general produce more sharply peaked topic distributions than the prior. This can be used
#' even without the covariates. This is not an option for topical content covariate models. If
#' you do not observe the topical content covariate, use the "Average" option.}
#' \item{"Average"}{We use a prior that is based on the average over the documents in the training
#' set. This does not require the unseen documents to observe the covariates. In a model that originally
#' had covariates we need to adjust our estimate of the variance-covariance matrix sigma to accommodate that
#' we no longer have the covariate information. So we recalculate the variance based on what it would have
#' been if we didn't have any covariates. This helps avoid an edge case where the covariates are extremely
#' influential and we don't want that strength applied to the new covariate-less setting. In the case of
#' content covariates this essentially use the \code{\link{sageLabels}} approach to create a
#' marginalized distribution over words for each topic.
#' }
#' \item{"Covariate"}{We use the same covariate driven prior that existed in the original model.
#' This requires that the test covariates be observed for all previously unseen documents.}
#' }
#'
#' If you fit a document that was used during training with the options to replicate the initial
#' \code{\link{stm}} model fit you will not necessarily get exactly the same result. \code{\link{stm}}
#' updates the topic-word distributions last so they may shifted since the document-topic proportions
#' were updated. If the original model converged, they should be very close.
#'
#' By default the function returns only the MAP estimate of the normalized document-topic proportions
#' theta. By selecting \code{returnPrior=TRUE} you can get the various model parameters used to complete
#' the fit. By selecting \code{returnPosterior=TRUE} you can get the full variational posterior. Please
#' note that the full variational posterior is very memory intensive. For a sense of scale it requires an
#' extra \eqn{K^2 + K \times (V'+1) + 1} doubles per document where V' is the number of unique tokens in the document.
#'
#' \strong{Testing:} Getting the prevalence covariates right in the unseen documents can be tricky. However
#' as long as you leave \code{test} set to \code{TRUE} the code will automatically run a test to make sure
#' that everything lines up. See the internal function \code{\link{makeDesignMatrix}} for more on what is
#' going on here.
#'
#' \strong{Passing a Design Matrix} Advanced users may wish to circumvent this process and pass their
#' own design matrix possibly because they used their own function for transforming the original input
#' variables. This can be done by passing the design matrix using the \code{designMatrix} argument
#' The columns need to match the ordering of the design matrix for the original \code{stm} object.
#' The design matrix in an stm model called \code{stmobj} can be found in \code{stmobj$settings$covariates$X}
#' which can in turn be used to check that you have formatted your result correctly. If you are going to
#' try this we recommend that you read the documentation for \code{\link{makeDesignMatrix}} to understand
#' some of the challenges involved.
#'
#' If you want even more fine-grained control we recommend you directly use the
#' optimization function \code{\link{optimizeDocument}}
#'
#' @param model the originally fit STM object.
#' @param documents the new documents to be fit. These documents must be in the stm format and
#' be numbered in the same way as the documents in the original model with the same dimension of vocabulary.
#' See \code{\link{alignCorpus}} or the \pkg{quanteda} feature \link[quanteda]{dfm_select}
#' for a way to do this.
#' @param newData the metadata for the prevalence prior which goes with the unseen documents. As in
#' the original data this cannot have any missing data.
#' @param origData the original metadata used to fit the STM object.
#' @param prevalence the original formula passed to prevalence when \code{stm} was called. The function
#' will try to reconstruct this.
#' @param betaIndex a vector which indicates which level of the content covariate is used
#' for each unseen document. If originally passed as a factor, this can be passed as a factor or
#' character vector as well but it must not have any levels not included in the original factor.
#' @param prevalencePrior three options described in detail below. Defaults to "Average" when
#' \code{data} is not provided and "Covariate" when it is.
#' @param contentPrior two options described in detail below. Defaults to "Average" when
#' \code{betaIndex} is not provided and "Covariate" when it is.
#' @param returnPosterior the function always returns the posterior mode of theta
#' (document-topic proportions), If set to TRUE this will return the full variational
#' posterior. Note that this will return a dense K-by-K matrix for every document
#' which can be very memory intensive if you are processing a lot of documents.
#' @param returnPriors the function always returns the options that were set for the prior
#' (either by the user or chosen internally by the defaults). In the case of content covariates
#' using the covariate prior this will be a set of indices to the original beta matrix so as
#' not to make the object too large.
#' @param designMatrix an option for advanced users to pass an already constructed design matrix for
#' prevalence covariates. This will override the options in \code{newData}, \code{origData} and
#' \code{test}. See details below- please do not attempt to use without reading carefully.
#' @param test a test of the functions ability to reconstruct the original functions.
#' @param verbose Should a dot be printed every time 1 percent of the documents are fit.
#'
#' @return an object of class fitNewDocuments
#'
#' \item{theta}{a matrix with one row per document contain the document-topic proportions at the posterior mode}
#' \item{eta}{the mean of the variational posterior, only provided when posterior is requested.
#' Matrix of same dimension as theta}
#' \item{nu}{a list with one element per document containing the covariance matrix of the variational posterior.
#' Only provided when posterior is requested.}
#' \item{phis}{a list with one element per K by V' matrix containing the variational distribution for each token
#' (where V' is the number of unique words in the given document. They are in the order of appearance in the document.
#' For words repeated more than once the sum of the column is the number of times that token appeared. This is only
#' provided if the posterior is requested.}
#' \item{bound}{a vector with one element per document containing the approximate variational lower bound. This is only
#' provided if the posterior is requested.}
#' \item{beta}{A list where each element contains the unlogged topic-word distribution for each level of the content covariate.
#' This is only provided if prior is requested.}
#' \item{betaindex}{a vector with one element per document indicating which element of the beta list the documents pairs with.
#' This is only provided if prior is requested.}
#' \item{mu}{a matrix where each column includes the K-1 dimension prior mean for each document. This is only provided if prior is requested.}
#' \item{sigma}{a K-1 by K-1 matrix containing the prior covariance. This is only provided if prior is requested.}
#' @seealso \code{\link{alignCorpus}} \code{\link{optimizeDocument}} \code{\link{make.heldout}} \code{\link{makeDesignMatrix}}
#' @examples
#' #An example using the Gadarian data. From Raw text to fitted model.
#' #(for a case where documents are all not processed at once see the help
#' # file for alignCorpus)
#' temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
#' out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
#' set.seed(02138)
#' #Maximum EM its is set low to make this run fast, run models to convergence!
#' mod.out <- stm(out$documents, out$vocab, 3, prevalence=~treatment + s(pid_rep),
#' data=out$meta, max.em.its=5)
#' fitNewDocuments(model=mod.out, documents=out$documents[1:5], newData=out$meta[1:5,],
#' origData=out$meta, prevalence=~treatment + s(pid_rep),
#' prevalencePrior="Covariate")
#' @export
fitNewDocuments <- function(model=NULL, documents=NULL, newData=NULL,
origData=NULL, prevalence=NULL, betaIndex=NULL,
prevalencePrior=c("Average","Covariate","None"),
contentPrior=c("Average","Covariate"),
returnPosterior=FALSE,
returnPriors=FALSE,
designMatrix=NULL, test=TRUE,
verbose=TRUE) {
prevalencePrior <- match.arg(prevalencePrior)
contentPrior <- match.arg(contentPrior)
#Some checks from the stm() function
if(!is.list(documents)) stop("documents must be a list, see documentation.")
if(!all(unlist(lapply(documents, is.matrix)))) stop("Each list element in documents must be a matrix. See documentation.")
if(any(unlist(lapply(documents, function(x) anyDuplicated(x[1,]))))) {
stop("Duplicate term indices within a document. See documentation for proper format.")
}
#Some additional checks here:
if(!is.null(newData)) {
if(nrow(newData)!=length(documents)) stop("Rows in the dataset does not match the length of the new documents")
if(max(unlist(lapply(documents, function(x) x[1,]))) > length(model$vocab)) stop("Vocabulary in new documents is larger than the original fitted model.")
}
#set up the counter
if(verbose) ctevery <- ifelse(length(documents)>100, floor(length(documents)/100), 1)
#Check arguments for prevalencePrior and content prior
if(!is.null(prevalencePrior)) {
prevtype <- match.arg(prevalencePrior)
if(prevtype == "Covariate" & is.null(newData)) stop("Must specify data to use covariate type prevalencePrior")
} else {
#if the data is supplied, use it.
prevtype <- ifelse(is.null(newData), "Average", "Covariate")
}
if(!is.null(contentPrior)){
contenttype <- match.arg(contentPrior)
if(contenttype == "Covariate" & is.null(betaIndex)) stop("Must specify betaIndex to use covariate type contentPrior")
} else {
#if the data is supplied use it.
contenttype <- ifelse(is.null(betaIndex), "Average", "Covariate")
}
if(is.null(model)) stop("Please specify a fitted STM object for model")
if(is.null(documents)) stop("Please specify an stm formatted document list for documents argument")
K <- model$settings$dim$K
#Start by Generating mu and sigma
#if None- set to 0's with a broad prior.
if(prevtype == "None"){
mu <- matrix(0, nrow=(K-1), ncol=length(documents))
sigma <- diag(1000, nrow=(K-1))
}
#if Average set to the average
if(prevtype == "Average") {
#if no covariates just use the actual model parameters
if(is.null(model$mu$gamma)) {
mu <- c(model$mu$mu)
sigma <- model$sigma
} else {
#otherwise recalculate the covariance
mu <- model$mu$mu
if(ncol(mu)==1) {
covariance <- crossprod(sweep(model$eta, 2, STATS=as.numeric(mu), FUN="-"))
} else {
covariance <- crossprod(model$eta - t(mu))
}
mu <- rowMeans(mu) #replace with the new averaged mu
newcovariance <- crossprod(sweep(model$eta, 2, STATS=mu))
#now to get sigma we just subtract off and replace
sigma <- model$sigma - covariance/nrow(model$eta) + newcovariance/nrow(model$eta)
}
}
if(prevtype=="Covariate"){
#Covariates will require a mu and sigma
if(!is.null(designMatrix)) {
#if we have manually been passed the design matrix use it here
if(ncol(model$settings$covariates$X) != ncol(designMatrix)) stop("designMatrix columns do not match original design matrix.")
X <- designMatrix
} else {
#First we need to find the formula
if(!is.null(prevalence)) {
#easiest case is that the user provides it.
if(!inherits(prevalence, "formula")) stop("prevalence argument must provide a formula if specified")
formula <- prevalence
} else {
#okay the user hasn't provided it, let's find it.
if(!is.null(model$setting$covariates$formula)) {
#if its a new stm object we've saved it.
formula <- model$settings$covariates$formula
} else {
#third case it isn't in the object, let's reconstruct it from the call
call <- model$settings$call
#grab the formula
#(code here matches the element of the call, grabs that component, drops the outside parentheses,
# converts to a formula)
formula <- as.formula(call[c(match(c("prevalence"), names(call),0L))][[1L]])
}
}
#okay now we have the formula we can call makeDesignMatrix
X <- makeDesignMatrix(formula, origData, newData, test=test)
}
sigma <- model$sigma
mu <- t(X%*%model$mu$gamma)
}
#Generate the Content Prior beta and betaindex
if(contentPrior == "Average") {
if(length(model$beta$logbeta) == 1) {
#in this scenario we have the customary case of no content covariates
betaindex <- rep(1,length(documents))
beta <- list(exp(model$beta$logbeta[[1]])) #put it in a list we can treat it the same
} else {
#in this scenario we have content covariates but no actual data so
#instead we marginalize.
betalist <- lapply(model$beta$logbeta, exp)
weights <- tabulate(model$settings$covariates$betaindex)
weights <- weights/sum(weights)
beta <- betalist[[1]]*weights[1]
for(i in 2:length(betalist)) {
beta <- beta + betalist[[i]]*weights[i]
}
beta <- list(beta) #put it in a list so we can treat them all the same
#we have now essentially decided to use a single beta
betaindex <- rep(1,length(documents))
}
} else {
#in this scenario we are going to have different betas
#based on what the user passed us.
levels <- model$settings$covariates$yvarlevels
betaindex <- as.numeric(factor(betaIndex, levels = levels))
if(any(is.na(betaindex))) stop("NA in the betaIndex variable. This could be due to the factor names not matching the original model.")
beta <- lapply(model$beta$logbeta, exp)
}
#####
#Now actually do the optimization
#####
#precompute terms for the loop over documents
sigobj <- try(chol.default(sigma), silent=TRUE)
if(inherits(sigobj,"try-error")) {
sigmaentropy <- (.5*determinant(sigma, logarithm=TRUE)$modulus[1])
siginv <- solve(sigma)
} else {
sigmaentropy <- sum(log(diag(sigobj)))
siginv <- chol2inv(sigobj)
}
lambda <- vector(mode="list", length=length(documents))
if(returnPosterior) {
nu <- vector(mode="list", length=length(documents))
phi <- vector(mode="list", length=length(documents))
bound <- vector(length=length(documents))
}
for(i in 1:length(documents)) {
#update components
doc <- documents[[i]]
aspect <- betaindex[i]
init <- rep(0, K-1)
#deal with the special case of mu as a vector.
if(!inherits(mu,"numeric")) {
mu.i <- mu[,i]
} else {
mu.i <- mu
}
#infer the document
results <- optimizeDocument(doc, eta=init, mu=mu.i, beta=beta[[aspect]],
sigmainv=siginv, sigmaentropy=sigmaentropy,
posterior=returnPosterior)
lambda[[i]] <- c(results$lambda)
if(returnPosterior) {
nu[[i]] <- results$nu
phi[[i]] <- results$phi
bound[i] <- results$bound
}
if(verbose && i%%ctevery==0) cat(".")
}
lambda <- do.call(rbind, lambda)
lambda <- cbind(lambda,0)
theta <- exp(lambda - row.lse(lambda))
#Return the results. Only use theta if they haven't
#asked for the priors or posterior.
toReturn <- list(theta=theta, eta=list(),
nu=list(), phi=list(),
bound=c(), beta=list(),
betaindex=c(),
mu=c(), sigma=c())
if(returnPosterior) {
toReturn$nu <- nu
toReturn$eta <- lambda[,-ncol(lambda), drop=FALSE]
toReturn$phi <- phi
toReturn$bound <- bound
}
if(returnPriors) {
toReturn$beta <- beta
toReturn$mu <- mu
toReturn$sigma <- sigma
toReturn$betaindex <- betaindex
}
class(toReturn) <- "fitNewDocuments"
return(toReturn)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.