R/TS.R
In LDATS: Latent Dirichlet Allocation Coupled with Time Series Analyses

Documented in check_formula check_TS_inputs est_changepoints est_regressors logLik.TS_fit measure_eta_vcov measure_rho_vcov package_TS prep_pbar print.TS_fit summarize_etas summarize_rhos TS TS_control update_pbar

#' @title Conduct a single multinomial Bayesian Time Series analysis 
#'
#' @description This is the main interface function for the LDATS application
#'   of Bayesian change point Time Series analyses (Christensen \emph{et al.}
#'   2018), which extends the model of Western and Kleykamp (2004;
#'   see also Ruggieri 2013) to multinomial (proportional) response data using
#'   softmax regression (Ripley 1996, Venables and Ripley 2002, Bishop 2006) 
#'   using a generalized linear modeling approach (McCullagh and Nelder 1989).
#'   The models are fit using parallel tempering Markov Chain Monte Carlo
#'   (ptMCMC) methods (Earl and Deem 2005) to locate change points and 
#'   neural networks (Ripley 1996, Venables and Ripley 2002, Bishop 2006) to
#'   estimate regressors. \cr \cr
#'   \code{check_TS_inputs} checks that the inputs to 
#'   \code{TS} are of proper classes for a full analysis.
#'
#' @param data \code{data.frame} including [1] the time variable (indicated 
#'   in \code{timename}), [2] the predictor variables (required by
#'   \code{formula}) and [3], the multinomial response variable (indicated in
#'   \code{formula}) as verified by \code{\link{check_timename}} and 
#'   \code{\link{check_formula}}. Note that the response variables should be
#'   formatted as a \code{data.frame} object named as indicated by the 
#'   \code{response} entry in the \code{control} list, such as \code{gamma} 
#'   for a standard TS analysis on LDA output. See \code{Examples}.
#'
#' @param formula \code{\link[stats]{formula}} defining the regression between
#'   relationship the change points. Any 
#'   predictor variable included must also be a column in 
#'   \code{data} and any (multinomial) response variable must be a set of
#'   columns in \code{data}, as verified by \code{\link{check_formula}}.
#'
#' @param nchangepoints \code{integer} corresponding to the number of 
#'   change points to include in the model. 0 is a valid input (corresponding
#'   to no change points, so a singular time series model), and the current 
#'   implementation can reasonably include up to 6 change points. The 
#'   number of change points is used to dictate the segmentation of the 
#'   time series into chunks fit with separate models dictated by 
#'   \code{formula}.
#'
#' @param timename \code{character} element indicating the time variable
#'   used in the time series. Defaults to \code{"time"}. The variable must be
#'   integer-conformable or a \code{Date}. If the variable named
#'   is a \code{Date}, the input is converted to an integer, resulting in the
#'   timestep being 1 day, which is often not desired behavior. 
#'
#' @param weights Optional class \code{numeric} vector of weights for each 
#'   document. Defaults to \code{NULL}, translating to an equal weight for
#'   each document. When using \code{multinom_TS} in a standard LDATS 
#'   analysis, it is advisable to weight the documents by their total size,
#'   as the result of \code{\link[topicmodels]{LDA}} is a matrix of 
#'   proportions, which does not account for size differences among documents.
#'   For most models, a scaling of the weights (so that the average is 1) is
#'   most appropriate, and this is accomplished using \code{document_weights}.
#'
#' @param control A \code{list} of parameters to control the fitting of the
#'   Time Series model including the parallel tempering Markov Chain 
#'   Monte Carlo (ptMCMC) controls. Values not input assume defaults set by 
#'   \code{\link{TS_control}}.
#'
#' @return \code{TS}: \code{TS_fit}-class list containing the following
#'   elements, many of
#'   which are hidden for \code{print}ing, but are accessible:
#'   \describe{
#'     \item{data}{\code{data} input to the function.}
#'     \item{formula}{\code{\link[stats]{formula}} input to the function.}
#'     \item{nchangepoints}{\code{nchangepoints} input to the function.}
#'     \item{weights}{\code{weights} input to the function.}
#'     \item{control}{\code{control} input to the function.}
#'     \item{lls}{Iteration-by-iteration 
#'                \link[=logLik.multinom_TS_fit]{logLik} values for the
#'                 full time series fit by \code{\link{multinom_TS}}.}
#'     \item{rhos}{Iteration-by-iteration change point estimates from
#'                 \code{\link{est_changepoints}}.}
#'     \item{etas}{Iteration-by-iteration marginal regressor estimates from
#'                 \code{\link{est_regressors}}, which have been 
#'                 unconditioned with respect to the change point locations.}
#'     \item{ptMCMC_diagnostics}{ptMCMC diagnostics, 
#'                                see \code{\link{diagnose_ptMCMC}}}
#'     \item{rho_summary}{Summary table describing \code{rhos} (the change
#'                        point locations), 
#'                        see \code{\link{summarize_rhos}}.}
#'     \item{rho_vcov}{Variance-covariance matrix for the estimates of
#'                      \code{rhos} (the change point locations), see 
#'                      \code{\link{measure_rho_vcov}}.}
#'     \item{eta_summary}{Summary table describing \code{ets} (the 
#'                        regressors), 
#'                        see \code{\link{summarize_etas}}.}
#'     \item{eta_vcov}{Variance-covariance matrix for the estimates of
#'                      \code{etas} (the regressors), see 
#'                      \code{\link{measure_eta_vcov}}.}
#'     \item{logLik}{Across-iteration average of log-likelihoods 
#'                    (\code{lls}).}
#'     \item{nparams}{Total number of parameters in the full model,
#'                    including the change point locations and regressors.}
#'     \item{deviance}{Penalized negative log-likelihood, based on 
#'                     \code{logLik} and \code{nparams}.}
#'   } 
#'   \code{check_TS_inputs}: An error message is thrown if any input
#'   is not proper, else \code{NULL}.
#'
#' @references
#'   Bishop, C. M. 2006. \emph{Pattern Recognition and Machine Learning}. 
#'    Springer, New York, NY, USA.
#'
#'   Christensen, E., D. J. Harris, and S. K. M. Ernest. 2018.
#'   Long-term community change through multiple rapid transitions in a 
#'   desert rodent community. \emph{Ecology} \strong{99}:1523-1529. 
#'   \href{https://pubmed.ncbi.nlm.nih.gov/29718539/}{link}.
#'
#'   Earl, D. J. and M. W. Deem. 2005. Parallel tempering: theory, 
#'   applications, and new perspectives. \emph{Physical Chemistry Chemical 
#'   Physics} \strong{7}: 3910-3916.
#'   \href{https://pubs.rsc.org/en/content/articlelanding/2005/CP/b509983h}{link}.
#'
#'   McCullagh, P. and J. A. Nelder. 1989. \emph{Generalized Linear Models}.
#'   2nd Edition. Chapman and Hall, New York, NY, USA.
#'
#'   Ripley, B. D. 1996. \emph{Pattern Recognition and Neural Networks}. 
#'   Cambridge University Press, Cambridge, UK.
#'
#'   Ruggieri, E. 2013. A Bayesian approach to detecting change points in 
#'   climactic records. \emph{International Journal of Climatology}
#'   \strong{33}:520-528.
#'   \href{https://mathcs.holycross.edu/~eruggier/joc3447.pdf}{link}.
#'
#'   Venables, W. N. and B. D. Ripley. 2002. \emph{Modern and Applied
#'   Statistics with S}. Fourth Edition. Springer, New York, NY, USA.
#'
#'   Western, B. and M. Kleykamp. 2004. A Bayesian change point model for 
#'   historical time series analysis. \emph{Political Analysis}
#'   \strong{12}:354-374.
#'   \href{https://www.cambridge.org/core/journals/political-analysis/article/abs/bayesian-change-point-model-for-historical-time-series-analysis/F7D2EDBBC211278EC6C6CB43FE170812}{link}.
#'
#' @examples 
#'   data(rodents)
#'   document_term_table <- rodents$document_term_table
#'   document_covariate_table <- rodents$document_covariate_table
#'   LDA_models <- LDA_set(document_term_table, topics = 2)[[1]]
#'   data <- document_covariate_table
#'   data$gamma <- LDA_models@gamma
#'   weights <- document_weights(document_term_table)
#' \donttest{
#'   TSmod <- TS(data, gamma ~ 1, nchangepoints = 1, "newmoon", weights)
#' }
#'   check_TS_inputs(data, timename = "newmoon")
#'
#' @export
#'
TS <- function(data, formula = gamma ~ 1, nchangepoints = 0, 
               timename = "time", weights = NULL, control = list()){
  check_TS_inputs(data, formula, nchangepoints, timename, weights, control)
  control <- do.call("TS_control", control)
  set.seed(control$seed)
  data <- data[order(data[,timename]), ]
  rho_dist <- est_changepoints(data, formula, nchangepoints, timename, 
                               weights, control)
  eta_dist <- est_regressors(rho_dist, data, formula, timename, weights, 
                             control)
  package_TS(data, formula, timename, weights, control, rho_dist, eta_dist)
}

#' @rdname TS
#' 
#' @export
#'
check_TS_inputs <- function(data, formula = gamma ~ 1, nchangepoints = 0, 
                            timename = "time", weights = NULL, 
                            control = list()){
  check_formula(data, formula)  
  check_nchangepoints(nchangepoints)
  check_weights(weights)
  check_timename(data, timename)
  check_control(control)
  return() 
}

#' @title Summarize the Time Series model 
#'
#' @description Calculate relevant summaries for the run of a Time Series
#'   model within \code{\link{TS}} and package the output as a
#'   \code{TS_fit}-class object.
#'
#' @param data \code{data.frame} including [1] the time variable (indicated 
#'   in \code{timename}), [2] the predictor variables (required by
#'   \code{formula}) and [3], the multinomial response variable (indicated in
#'   \code{formula}) as verified by \code{\link{check_timename}} and 
#'   \code{\link{check_formula}}. Note that the response variables should be
#'   formatted as a \code{data.frame} object named as indicated by the 
#'   \code{response} entry in the \code{control} list, such as \code{gamma} 
#'   for a standard TS analysis on LDA output. 
#'
#' @param formula \code{\link[stats]{formula}} defining the regression between
#'   relationship the change points. Any 
#'   predictor variable included must also be a column in 
#'   \code{data} and any (multinomial) response variable must be a set of
#'   columns in \code{data}, as verified by \code{\link{check_formula}}.
#'
#' @param timename \code{character} element indicating the time variable
#'   used in the time series. 
#'
#' @param weights Optional class \code{numeric} vector of weights for each 
#'   document. Defaults to \code{NULL}, translating to an equal weight for
#'   each document. When using \code{multinom_TS} in a standard LDATS 
#'   analysis, it is advisable to weight the documents by their total size,
#'   as the result of \code{\link[topicmodels]{LDA}} is a matrix of 
#'   proportions, which does not account for size differences among documents.
#'   For most models, a scaling of the weights (so that the average is 1) is
#'   most appropriate, and this is accomplished using \code{document_weights}.
#'
#' @param control A \code{list} of parameters to control the fitting of the
#'   Time Series model including the parallel tempering Markov Chain 
#'   Monte Carlo (ptMCMC) controls. Values not input assume defaults set by 
#'   \code{\link{TS_control}}.
#'
#' @param rho_dist List of saved data objects from the ptMCMC estimation of
#'   change point locations returned by \code{\link{est_changepoints}}
#'   (unless \code{nchangepoints} is 0, then \code{NULL}).
#'
#' @param eta_dist Matrix of draws (rows) from the marginal posteriors of the 
#'   coefficients across the segments (columns), as estimated by
#'   \code{\link{est_regressors}}. 
#'
#' @return \code{TS_fit}-class list containing the following elements, many of
#'   which are hidden for \code{print}ing, but are accessible:
#'   \describe{
#'     \item{data}{\code{data} input to the function.}
#'     \item{formula}{\code{\link[stats]{formula}} input to the function.}
#'     \item{nchangepoints}{\code{nchangepoints} input to the function.}
#'     \item{weights}{\code{weights} input to the function.}
#'     \item{timename}{\code{timename} input to the function.}
#'     \item{control}{\code{control} input to the function.}
#'     \item{lls}{Iteration-by-iteration 
#'                \link[=logLik.multinom_TS_fit]{logLik} values for the
#'                 full time series fit by \code{\link{multinom_TS}}.}
#'     \item{rhos}{Iteration-by-iteration change point estimates from
#'                 \code{\link{est_changepoints}}.}
#'     \item{etas}{Iteration-by-iteration marginal regressor estimates from
#'                 \code{\link{est_regressors}}, which have been 
#'                 unconditioned with respect to the change point locations.}
#'     \item{ptMCMC_diagnostics}{ptMCMC diagnostics, 
#'                                see \code{\link{diagnose_ptMCMC}}}
#'     \item{rho_summary}{Summary table describing \code{rhos} (the change
#'                        point locations), 
#'                        see \code{\link{summarize_rhos}}.}
#'     \item{rho_vcov}{Variance-covariance matrix for the estimates of
#'                      \code{rhos} (the change point locations), see 
#'                      \code{\link{measure_rho_vcov}}.}
#'     \item{eta_summary}{Summary table describing \code{ets} (the 
#'                        regressors), 
#'                        see \code{\link{summarize_etas}}.}
#'     \item{eta_vcov}{Variance-covariance matrix for the estimates of
#'                      \code{etas} (the regressors), see 
#'                      \code{\link{measure_eta_vcov}}.}
#'     \item{logLik}{Across-iteration average of log-likelihoods 
#'                    (\code{lls}).}
#'     \item{nparams}{Total number of parameters in the full model,
#'                    including the change point locations and regressors.}
#'     \item{AIC}{Penalized negative log-likelihood, based on 
#'                     \code{logLik} and \code{nparams}.}
#'   }
#'
#' @examples 
#' \donttest{
#'   data(rodents)
#'   document_term_table <- rodents$document_term_table
#'   document_covariate_table <- rodents$document_covariate_table
#'   LDA_models <- LDA_set(document_term_table, topics = 2)[[1]]
#'   data <- document_covariate_table
#'   data$gamma <- LDA_models@gamma
#'   weights <- document_weights(document_term_table)
#'   formula <- gamma ~ 1
#'   nchangepoints <- 1
#'   control <- TS_control()
#'   data <- data[order(data[,"newmoon"]), ]
#'   rho_dist <- est_changepoints(data, formula, nchangepoints, "newmoon", 
#'                                weights, control)
#'   eta_dist <- est_regressors(rho_dist, data, formula, "newmoon", weights, 
#'                              control)
#'   package_TS(data, formula, "newmoon", weights, control, rho_dist, 
#'              eta_dist)
#' }
#'
#' @export
#'
package_TS <- function(data, formula, timename, weights, control, rho_dist, 
                         eta_dist){

  check_formula(data, formula) 
  check_weights(weights)
  check_control(control)
  check_timename(data, timename)
  control <- do.call("TS_control", control)
  nchangepoints <- dim(rho_dist$cpts)[1]
  if (is.null(nchangepoints)){
    nchangepoints <- 0
    mod <- multinom_TS(data, formula, changepoints = NULL, timename, weights,
                       control)
    mod <- mod[[1]][[1]]
    lls <- as.numeric(logLik(mod))
    rhos <- NULL
  } else{
    lls <- rho_dist$lls[1, ]
    rhos <- t(array(rho_dist$cpts[ , 1, ], dim = dim(rho_dist$cpts)[c(1, 3)]))
  }

  ptMCMC_diagnostics <- diagnose_ptMCMC(rho_dist)
  rho_summary <- summarize_rhos(rhos, control)
  rho_vcov <- measure_rho_vcov(rhos)
  eta_summary <- summarize_etas(eta_dist, control)
  eta_vcov <- measure_eta_vcov(eta_dist)

  logLik <- mean(lls)
  ncoefs <- ncol(eta_dist)
  nparams <- nchangepoints + ncoefs 
  AIC <- -2 * logLik + 2 * nparams

  out <- list(data = data, formula = formula, nchangepoints = nchangepoints,
              timename = timename, weights = weights,
              control = control, lls = lls, rhos = rhos,
              etas = eta_dist, ptMCMC_diagnostics = ptMCMC_diagnostics,
              rho_summary = rho_summary, rho_vcov = rho_vcov,
              eta_summary = eta_summary, eta_vcov = eta_vcov,
              logLik = logLik, nparams = nparams, AIC = AIC)
  class(out) <- c("TS_fit", "list")
  to_hide <- c("data", "weights", "control", "lls", "rhos", "etas", 
               "rho_vcov", "eta_vcov")
  if (nchangepoints == 0){
    to_hide <- c(to_hide, "ptMCMC_diagnostics", "rho_summary")
  }
  attr(out, "hidden") <- to_hide
  out
}

#' @title Print a Time Series model fit
#'
#' @description Convenience function to print only the most important 
#'   components of a \code{TS_fit}-class object fit by 
#'   \code{\link{TS}}.
#'
#' @param x Class \code{TS_fit} object to be printed.
#'
#' @param ... Not used, simply included to maintain method compatibility.
#'
#' @return The non-hidden parts of \code{x} as a \code{list}.
#'
#' @examples 
#' \donttest{
#'   data(rodents)
#'   document_term_table <- rodents$document_term_table
#'   document_covariate_table <- rodents$document_covariate_table
#'   LDA_models <- LDA_set(document_term_table, topics = 2)[[1]]
#'   data <- document_covariate_table
#'   data$gamma <- LDA_models@gamma
#'   weights <- document_weights(document_term_table)
#'   TSmod <- TS(data, gamma ~ 1, nchangepoints = 1, "newmoon", weights)
#'   print(TSmod)
#' }
#'
#' @export
#'
print.TS_fit <- function(x, ...){
  hid <- attr(x, "hidden")
  notHid <- !names(x) %in% hid
  print(x[notHid])
}

#' @title Summarize the regressor (eta) distributions
#'
#' @description \code{summarize_etas} calculates summary statistics for each
#'   of the chunk-level regressors. 
#'   \cr \cr
#'   \code{measure_ets_vcov} generates the variance-covariance matrix for 
#'   the regressors.
#'
#' @param etas Matrix of regressors (columns) across iterations of the 
#'   ptMCMC (rows), as returned from \code{\link{est_regressors}}.
#'
#' @param control A \code{list} of parameters to control the fitting of the
#'   Time Series model including the parallel tempering Markov Chain 
#'   Monte Carlo (ptMCMC) controls. Values not input assume defaults set by 
#'   \code{\link{TS_control}}.
#'
#' @return \code{summarize_etas}: table of summary statistics for chunk-level
#'   regressors including mean, median, mode, posterior interval, standard
#'   deviation, MCMC error, autocorrelation, and effective sample size for 
#'   each regressor. \cr \cr
#'   \code{measure_eta_vcov}: variance-covariance matrix for chunk-level
#'   regressors.
#'
#' @examples
#'  etas <- matrix(rnorm(100), 50, 2)
#'  summarize_etas(etas)
#'  measure_eta_vcov(etas)
#' 
#' @export 
#'
summarize_etas <- function(etas, control = list()){
  check_control(control)
  control <- do.call("TS_control", control)
  if (!is.matrix(etas)){
    stop("etas should be a matrix")
  }
  prob <- control$summary_prob
  Mean <- round(apply(etas, 2, mean), 4)
  Median <- round(apply(etas, 2, median), 4)
  SD <- round(apply(etas, 2, sd), 4)
  MCMCerr <- round(SD / sqrt(nrow(etas)), 4)
  HPD <- HPDinterval(as.mcmc(etas), prob = prob)
  Lower <- round(HPD[ , "lower"], 4)
  Upper <- round(HPD[ , "upper"], 4)
  AC10 <- tryCatch(t(round(autocorr.diag(as.mcmc(etas), lag = 10), 4)),
                   error = function(x){"-"})
  ESS <- effectiveSize(etas)
  out <- data.frame(Mean, Median, Lower, Upper, SD, MCMCerr, AC10, ESS)
  colnames(out)[3:4] <- paste0(c("Lower_", "Upper_"), paste0(prob*100, "%"))
  colnames(out)[7] <- "AC10"
  rownames(out) <- colnames(etas)
  out
}

#' @rdname summarize_etas 
#'
#' @export
#'
measure_eta_vcov <- function(etas){
  if (!is.matrix(etas)){
    stop("expecting etas to be a matrix")
  }
  out <- var(etas)
  colnames(out) <- colnames(etas)
  rownames(out) <- colnames(etas)
  out
}

#' @title Summarize the rho distributions
#'
#' @description \code{summarize_rho} calculates summary statistics for each
#'   of the change point locations.
#'   \cr \cr
#'   \code{measure_rho_vcov} generates the variance-covariance matrix for the 
#'   change point locations.
#'
#' @param rhos Matrix of change point locations (columns) across iterations of 
#'   the ptMCMC (rows) or \code{NULL} if no change points are in the model,
#'   as returned from \code{\link{est_changepoints}}.
#'
#' @param control A \code{list} of parameters to control the fitting of the
#'   Time Series model including the parallel tempering Markov Chain 
#'   Monte Carlo (ptMCMC) controls. Values not input assume defaults set by 
#'   \code{\link{TS_control}}.
#'
#' @return \code{summarize_rhos}: table of summary statistics for change point
#'   locations including mean, median, mode, posterior interval, standard
#'   deviation, MCMC error, autocorrelation, and effective sample size for 
#'   each change point location. \cr \cr
#'   \code{measure_rho_vcov}: variance-covariance matrix for change 
#'   point locations.
#'
#' @examples
#'  rhos <- matrix(sample(80:100, 100, TRUE), 50, 2)
#'  summarize_rhos(rhos)
#'  measure_rho_vcov(rhos)
#' 
#' @export 
#'
summarize_rhos <- function(rhos, control = list()){
  check_control(control)
  control <- do.call("TS_control", control)
  if (is.null(rhos)) {
    return()
  }
  prob <- control$summary_prob
  Mean <- round(apply(rhos, 2, mean), 2)
  Median <- apply(rhos, 2, median)
  Mode <- apply(rhos, 2, modalvalue)
  SD <- round(apply(rhos, 2, sd), 2)
  MCMCerr <- round(SD / sqrt(nrow(rhos)), 4)
  HPD <- HPDinterval(as.mcmc(rhos), prob = prob)
  Lower <- HPD[ , "lower"]
  Upper <- HPD[ , "upper"]
  AC10 <- t(round(autocorr.diag(as.mcmc(rhos), lag = 10), 4))
  ESS <- effectiveSize(rhos)
  out <- data.frame(Mean, Median, Mode, Lower, Upper, SD, MCMCerr, AC10, ESS)
  colnames(out)[4:5] <- paste0(c("Lower_", "Upper_"), paste0(prob*100, "%"))
  colnames(out)[8] <- "AC10"
  rownames(out) <- sprintf("Changepoint_%d", seq_len(nrow(out)))
  out
}

#' @rdname summarize_rhos 
#'
#' @export
#'
measure_rho_vcov <- function(rhos){
  if (is.null(rhos)) {
    return()
  }
  if (!is.matrix(rhos)){
    stop("expecting rhos to be a matrix")
  }
  out <- var(rhos)
  colnames(out) <- sprintf("CP_%d", 1:dim(out)[1])
  rownames(out) <- sprintf("CP_%d", 1:dim(out)[2])
  out
}

#' @title Estimate the distribution of regressors, unconditional on the
#'   change point locations
#'
#' @description This function uses the marginal posterior distributions of
#'   the change point locations (estimated by \code{\link{est_changepoints}})
#'   in combination with the conditional (on the change point locations) 
#'   posterior distributions of the regressors (estimated by
#'   \code{\link{multinom_TS}}) to estimate the marginal posterior 
#'   distribution of the regressors, unconditional on the change point 
#'   locations.
#'
#' @details The general approach follows that of Western and Kleykamp
#'   (2004), although we note some important differences. Our regression
#'   models are fit independently for each chunk (segment of time), and 
#'   therefore the variance-covariance matrix for the full model 
#'   has \code{0} entries for covariances between regressors in different
#'   chunks of the time series. Further, because the regression model here
#'   is a standard (non-hierarchical) softmax (Ripley 1996, Venables and 
#'   Ripley 2002, Bishop 2006), there is no error term in the regression  
#'   (as there is in the normal model used by Western and Kleykamp 2004), 
#'   and so the posterior distribution used here is a multivariate normal,
#'   as opposed to a multivariate t, as used by Western and Kleykamp (2004).
#'
#' @param rho_dist List of saved data objects from the ptMCMC estimation of
#'   change point locations (unless \code{nchangepoints} is 0, then 
#'   \code{NULL}) returned from \code{\link{est_changepoints}}.
#'
#' @param data \code{data.frame} including [1] the time variable (indicated 
#'   in \code{timename}), [2] the predictor variables (required by
#'   \code{formula}) and [3], the multinomial response variable (indicated in
#'   \code{formula}) as verified by \code{\link{check_timename}} and 
#'   \code{\link{check_formula}}. Note that the response variables should be
#'   formatted as a \code{data.frame} object named as indicated by the 
#'   \code{response} entry in the \code{control} list, such as \code{gamma} 
#'   for a standard TS analysis on LDA output. 
#'
#' @param formula \code{\link[stats]{formula}} defining the regression between
#'   relationship the change points. Any 
#'   predictor variable included must also be a column in 
#'   \code{data} and any (multinomial) response variable must be a set of
#'   columns in \code{data}, as verified by \code{\link{check_formula}}.
#'
#' @param timename \code{character} element indicating the time variable
#'   used in the time series. 
#'
#' @param weights Optional class \code{numeric} vector of weights for each 
#'   document. Defaults to \code{NULL}, translating to an equal weight for
#'   each document. When using \code{multinom_TS} in a standard LDATS 
#'   analysis, it is advisable to weight the documents by their total size,
#'   as the result of \code{\link[topicmodels]{LDA}} is a matrix of 
#'   proportions, which does not account for size differences among documents.
#'   For most models, a scaling of the weights (so that the average is 1) is
#'   most appropriate, and this is accomplished using \code{document_weights}.
#'
#' @param control A \code{list} of parameters to control the fitting of the
#'   Time Series model including the parallel tempering Markov Chain 
#'   Monte Carlo (ptMCMC) controls. Values not input assume defaults set by 
#'   \code{\link{TS_control}}.
#'
#' @return \code{matrix} of draws (rows) from the marginal posteriors of the 
#'   coefficients across the segments (columns). 
#'
#' @references
#'   Bishop, C. M. 2006. \emph{Pattern Recognition and Machine Learning}. 
#'    Springer, New York, NY, USA.
#'
#'   Ripley, B. D. 1996. \emph{Pattern Recognition and Neural Networks}. 
#'   Cambridge University Press, Cambridge, UK.
#'
#'   Venables, W. N. and B. D. Ripley. 2002. \emph{Modern and Applied
#'   Statistics with S}. Fourth Edition. Springer, New York, NY, USA.
#'
#'   Western, B. and M. Kleykamp. 2004. A Bayesian change point model for 
#'   historical time series analysis. \emph{Political Analysis}
#'   \strong{12}:354-374.
#'   \href{https://www.cambridge.org/core/journals/political-analysis/article/abs/bayesian-change-point-model-for-historical-time-series-analysis/F7D2EDBBC211278EC6C6CB43FE170812}{link}.
#'
#' @examples 
#' \donttest{
#'   data(rodents)
#'   document_term_table <- rodents$document_term_table
#'   document_covariate_table <- rodents$document_covariate_table
#'   LDA_models <- LDA_set(document_term_table, topics = 2)[[1]]
#'   data <- document_covariate_table
#'   data$gamma <- LDA_models@gamma
#'   weights <- document_weights(document_term_table)
#'   formula <- gamma ~ 1
#'   nchangepoints <- 1
#'   control <- TS_control()
#'   data <- data[order(data[,"newmoon"]), ]
#'   rho_dist <- est_changepoints(data, formula, nchangepoints, "newmoon", 
#'                                weights, control)
#'   eta_dist <- est_regressors(rho_dist, data, formula, "newmoon", weights, 
#'                              control)
#' }
#'
#' @export
#'
est_regressors <- function(rho_dist, data, formula, timename, weights, 
                           control = list()){
  check_formula(data, formula)  
  check_weights(weights)
  check_control(control)
  control <- do.call("TS_control", control)
  if (!is.null(rho_dist)){
    if (any(names(rho_dist)[1:3] != c("cpts", "lls", "ids"))){
      stop("expecting rho_dist to have elements cpts, lls, ids")
    }
  }
  if (is.null(rho_dist)){
    mod <- multinom_TS(data, formula, changepoints = NULL, timename, weights, 
                       control)
    mod <- mod[[1]][[1]]
    mv <- as.vector(t(coef(mod)))
    vcv <- mirror_vcov(mod)
    eta <- rmvnorm(control$nit, mv, vcv)
    seg_names <- rep(1, ncol(vcv))
    coef_names <- colnames(vcv)
    colnames(eta) <- paste(seg_names, coef_names, sep = "_")
    return(eta)
  }

  focal_rho <- rho_dist$cpts[ , 1, ]
  nchangepts <- dim(rho_dist$cpts)[1]
  if (nchangepts == 1){
    collapsedrho <- focal_rho
  } else{
    collapsedrho <- apply(focal_rho, 2, paste, collapse = "_")
  }
  freq_r <- table(collapsedrho)
  unique_r <- names(freq_r)
  nr <- length(unique_r)
  n_topic <- ncol(data$gamma)
  n_covar <- length(attr(terms(formula), "term.labels"))
  n_eta_segment <- (n_topic - 1) * (n_covar + 1)
  n_changept <- dim(rho_dist$cpts)[1]
  n_segment <- n_changept + 1
  n_eta <- n_eta_segment * n_segment 
  eta <- matrix(NA, nrow = control$nit, ncol = n_eta)
  pbar <- prep_pbar(control, "eta", nr)

  for(i in 1:nr){
    update_pbar(pbar, control)
    cpts <- as.numeric(strsplit(unique_r[i], "_")[[1]])
    mods <- multinom_TS(data, formula, cpts, timename, weights, control)
    ndraws <- freq_r[i]
    colindex1 <- 1
    for(j in 1:n_segment){
      colindex2 <- colindex1 + n_eta_segment - 1
      seg_mod <- mods[[1]][[j]]
      mv <- as.vector(t(coef(seg_mod)))
      vcv <- mirror_vcov(seg_mod)
      drawn <- rmvnorm(ndraws, mv, vcv)    
      rows_in <- which(collapsedrho == unique_r[i])
      cols_in <- colindex1:colindex2
      eta[rows_in, cols_in] <- drawn
      colindex1 <- colindex2 + 1
    }
  }
  seg_names <- rep(1:n_segment, each = n_eta_segment)
  coef_names <- rep(colnames(vcv), n_segment)
  colnames(eta) <- paste(seg_names, coef_names, sep = "_")
  eta
}


#' @title Use ptMCMC to estimate the distribution of change point locations
#'
#' @description This function executes ptMCMC-based estimation of the 
#'   change point location distributions for multinomial Time Series analyses.
#'
#' @param data \code{data.frame} including [1] the time variable (indicated 
#'   in \code{timename}), [2] the predictor variables (required by
#'   \code{formula}) and [3], the multinomial response variable (indicated in
#'   \code{formula}) as verified by \code{\link{check_timename}} and 
#'   \code{\link{check_formula}}. Note that the response variables should be
#'   formatted as a \code{data.frame} object named as indicated by the 
#'   \code{response} entry in the \code{control} list, such as \code{gamma} 
#'   for a standard TS analysis on LDA output. 
#'
#' @param formula \code{\link[stats]{formula}} defining the regression between
#'   relationship the change points. Any 
#'   predictor variable included must also be a column in 
#'   \code{data} and any (multinomial) response variable must be a set of
#'   columns in \code{data}, as verified by \code{\link{check_formula}}.
#'
#' @param nchangepoints \code{integer} corresponding to the number of 
#'   change points to include in the model. 0 is a valid input (corresponding
#'   to no change points, so a singular time series model), and the current 
#'   implementation can reasonably include up to 6 change points. The 
#'   number of change points is used to dictate the segmentation of the 
#'   time series into chunks fit with separate models dictated by 
#'   \code{formula}.
#'
#' @param timename \code{character} element indicating the time variable
#'   used in the time series. 
#'
#' @param weights Optional class \code{numeric} vector of weights for each 
#'   document. Defaults to \code{NULL}, translating to an equal weight for
#'   each document. When using \code{multinom_TS} in a standard LDATS 
#'   analysis, it is advisable to weight the documents by their total size,
#'   as the result of \code{\link[topicmodels]{LDA}} is a matrix of 
#'   proportions, which does not account for size differences among documents.
#'   For most models, a scaling of the weights (so that the average is 1) is
#'   most appropriate, and this is accomplished using \code{document_weights}.
#'
#' @param control A \code{list} of parameters to control the fitting of the
#'   Time Series model including the parallel tempering Markov Chain 
#'   Monte Carlo (ptMCMC) controls. Values not input assume defaults set by 
#'   \code{\link{TS_control}}.
#'
#' @return List of saved data objects from the ptMCMC estimation of
#'   change point locations (unless \code{nchangepoints} is 0, then 
#'   \code{NULL} is returned).
#'
#' @examples 
#' \donttest{
#'   data(rodents)
#'   document_term_table <- rodents$document_term_table
#'   document_covariate_table <- rodents$document_covariate_table
#'   LDA_models <- LDA_set(document_term_table, topics = 2)[[1]]
#'   data <- document_covariate_table
#'   data$gamma <- LDA_models@gamma
#'   weights <- document_weights(document_term_table)
#'   formula <- gamma ~ 1
#'   nchangepoints <- 1
#'   control <- TS_control()
#'   data <- data[order(data[,"newmoon"]), ]
#'   rho_dist <- est_changepoints(data, formula, nchangepoints, "newmoon", 
#'                                weights, control)
#' }
#'
#' @export
#'
est_changepoints <- function(data, formula, nchangepoints, timename, weights, 
                             control = list()){
  check_TS_inputs(data, formula, nchangepoints, timename, weights, control)
  control <- do.call("TS_control", control)
  if (nchangepoints == 0){
    return(NULL)
  }
  saves <- prep_saves(nchangepoints, control)
  inputs <- prep_ptMCMC_inputs(data, formula, nchangepoints, timename, 
                               weights, control)
  cpts <- prep_cpts(data, formula, nchangepoints, timename, weights, control)
  ids <- prep_ids(control)
  pbar <- prep_pbar(control, "rho")

  for(i in 1:control$nit){
    update_pbar(pbar, control)
    steps <- step_chains(i, cpts, inputs)
    swaps <- swap_chains(steps, inputs, ids)
    saves <- update_saves(i, saves, steps, swaps)
    cpts <- update_cpts(cpts, swaps)
    ids <- update_ids(ids, swaps)
  }
  process_saves(saves, control)
}

#' @title Initialize and tick through the progress bar
#'
#' @description \code{prep_pbar} creates and \code{update_pbar} steps
#'   through the progress bars (if desired) in \code{\link{TS}}
#'
#' @param control A \code{list} of parameters to control the fitting of the
#'   Time Series model including the parallel tempering Markov Chain 
#'   Monte Carlo (ptMCMC) controls. Values not input assume defaults set by 
#'   \code{\link{TS_control}}. Of use here is \code{quiet} which is a
#'   a \code{logical} indicator of whether there should be information 
#'   (i.e. the progress bar) printed during the run or not. Default is 
#'   \code{TRUE}.
#'
#' @param bar_type "rho" (for change point locations) or "eta" (for 
#'   regressors).
#'
#' @param nr \code{integer} number of unique realizations, needed when
#'   \code{bar_type} = "eta".
#'
#' @param pbar The progress bar object returned from \code{prep_pbar}.
#'
#' @return \code{prep_pbar}: the initialized progress bar object. \cr \cr
#'   \code{update_pbar}: the ticked-forward \code{pbar}.
#'
#' @examples
#'   pb <- prep_pbar(control = list(nit = 2)); pb
#'   pb <- update_pbar(pb); pb
#'   pb <- update_pbar(pb); pb
#'
#' @export
#'
prep_pbar <- function(control = list(), bar_type = "rho", 
                      nr = NULL){
  check_control(control)
  control <- do.call("TS_control", control)
  if (!(bar_type %in% c("eta", "rho"))){
    stop("bar_type must be eta or rho")
  }
  if (!is.null(nr)){
    if (!is.numeric(nr) || any(nr %% 1 != 0)){
      stop("nr must be integer-valued")
    }
  }
  form <- "  [:bar] :percent eta: :eta"
  if (bar_type == "rho"){
    msg <- "  Estimating changepoint distribution"
    out <- progress_bar$new(form, control$nit, width = 60)
  }
  if (bar_type == "eta"){
    msg <- "  Estimating regressor distribution"
    out <- progress_bar$new(form, nr, width = 60)
  }
  messageq(msg, control$quiet)
  out
}

#' @rdname prep_pbar
#'
#' @export
#'
update_pbar <- function(pbar, control = list()){
  if (!("progress_bar" %in% class(pbar))){
    stop("pbar must be of class progress_bar")
  }
  check_control(control)
  control <- do.call("TS_control", control)
  if (control$quiet){
    return()
  }
  pbar$tick()
}


#' @title Check that a formula is proper
#' 
#' @description Check that \code{formula} is actually a 
#'   \code{\link[stats]{formula}} and that the
#'   response and predictor variables are all included in \code{data}.
#'   
#' @param formula \code{formula} to evaluate.
#'
#' @param data \code{data.frame} including [1] the time variable (indicated 
#'   in \code{timename}), [2] the predictor variables (required by
#'   \code{formula}) and [3], the multinomial response variable (indicated in
#'   \code{formula}) as verified by \code{\link{check_timename}} and 
#'   \code{\link{check_formula}}. Note that the response variables should be
#'   formatted as a \code{data.frame} object named as indicated by the 
#'   \code{response} entry in the \code{control} list, such as \code{gamma} 
#'   for a standard TS analysis on LDA output. 
#' 
#' @return An error message is thrown if \code{formula} is not proper,
#'   else \code{NULL}.
#' 
#' @examples 
#'   data(rodents)
#'   document_term_table <- rodents$document_term_table
#'   document_covariate_table <- rodents$document_covariate_table
#'   LDA_models <- LDA_set(document_term_table, topics = 2)[[1]]
#'   data <- document_covariate_table
#'   data$gamma <- LDA_models@gamma
#'   check_formula(data, gamma ~ 1)
#'
#' @export
#'
check_formula <- function(data, formula){

  if (!is(formula, "formula")){
    stop("formula does not contain a single formula")
  }

  respLoc <- attr(terms(formula), "response")
  if (respLoc == 0){
    stop("formula inputs should include response variable")
  }  

  resp <- as.character(attr(terms(formula), "variables"))[-1][respLoc]
  pred <- attr(terms(formula), "term.labels")
  if (!resp %in% colnames(data)){
    stop("formula includes response not present in data")
  }
  if (!all(pred %in% colnames(data))){
    misses <- pred[which(pred %in% colnames(data) == FALSE)]
    mis <- paste(misses, collapse = ", ")
    stop(paste0("formula includes predictors not present in data: ", mis))
  }
  return()
}

#' @title Create the controls list for the Time Series model
#'
#' @description This function provides a simple creation and definition of a
#'   list used to control the time series model fit occurring within 
#'   \code{\link{TS}}. 
#'
#' @param memoise \code{logical} indicator of whether the multinomial 
#'   functions should be memoised (via \code{\link[memoise]{memoise}}). 
#'   Memoisation happens to both \code{\link{multinom_TS}} and 
#'   \code{\link{multinom_TS_chunk}}.
#'
#' @param response \code{character} element indicating the response variable 
#'   used in the time series. 
#'
#' @param lambda \code{numeric} "weight" decay term used to set the prior
#'   on the regressors within each chunk-level model. Defaults to 0, 
#'   corresponding to a fully vague prior.
#'
#' @param measurer,selector Function names for use in evaluation of the TS
#'   models. \code{measurer} is used to create a value for each model
#'   and \code{selector} operates on the values to choose the model. 
#'
#' @param ntemps \code{integer} number of temperatures (chains) to use in the 
#'   ptMCMC algorithm.
#'
#' @param penultimate_temp Penultimate temperature in the ptMCMC sequence.
#'
#' @param ultimate_temp Ultimate temperature in the ptMCMC sequence.
#'
#' @param q Exponent controlling the ptMCMC temperature sequence from the 
#'   focal chain (reference with temperature = 1) to the penultimate chain. 0
#'   (default) implies a geometric sequence. 1 implies squaring before 
#'   exponentiating.
#'
#' @param nit \code{integer} number of iterations (steps) used in the ptMCMC
#'   algorithm.
#'
#' @param magnitude Average magnitude (defining a geometric distribution)
#'   for the proposed step size in the ptMCMC algorithm.
#'
#' @param quiet \code{logical} indicator of whether the model should run 
#'   quietly (if \code{FALSE}, a progress bar and notifications are printed).
#'
#' @param burnin \code{integer} number of iterations to remove from the 
#'   beginning of the ptMCMC algorithm.
#'
#' @param thin_frac Fraction of iterations to retain, must be \eqn{(0, 1]},
#'   and the default value of 1 represents no thinning.
#'
#' @param summary_prob Probability used for summarizing the posterior 
#'   distributions (via the highest posterior density interval, see
#'   \code{\link[coda]{HPDinterval}}).
#'
#' @param seed Input to \code{set.seed} for replication purposes.
#'
#' @return \code{list}, with named elements corresponding to the arguments.
#'
#' @examples
#'   TS_control()
#'
#' @export
#'
TS_control <- function(memoise = TRUE, response = "gamma", lambda = 0, 
                       measurer = AIC, selector = min, ntemps = 6, 
                       penultimate_temp = 2^6, ultimate_temp = 1e10, q = 0, 
                       nit = 1e4, magnitude = 12, quiet = FALSE, burnin = 0, 
                       thin_frac = 1, summary_prob = 0.95, seed = NULL){
  list(memoise = memoise, response = response, lambda = lambda, 
       measurer = measurer, selector = selector, ntemps = ntemps, 
       penultimate_temp = penultimate_temp, ultimate_temp = ultimate_temp, 
       q = q, nit = nit, magnitude = magnitude, quiet = quiet, 
       burnin = burnin, thin_frac = thin_frac, summary_prob = summary_prob,
       seed = seed)

}

#' @title Determine the log likelihood of a Time Series model
#'
#' @description Convenience function to extract and format the log likelihood
#'   of a \code{TS_fit}-class object fit by \code{\link{multinom_TS}}.
#'
#' @param object Class \code{TS_fit} object to be evaluated.
#'
#' @param ... Not used, simply included to maintain method compatibility.
#'
#' @return Log likelihood of the model \code{logLik}, also with \code{df}
#'   (degrees of freedom) and \code{nobs} (number of observations) values.
#'
#' @examples 
#' \donttest{
#'   data(rodents)
#'   document_term_table <- rodents$document_term_table
#'   document_covariate_table <- rodents$document_covariate_table
#'   LDA_models <- LDA_set(document_term_table, topics = 2)[[1]]
#'   data <- document_covariate_table
#'   data$gamma <- LDA_models@gamma
#'   weights <- document_weights(document_term_table)
#'   TSmod <- TS(data, gamma ~ 1, nchangepoints = 1, "newmoon", weights)
#'   logLik(TSmod)
#' }
#'
#' @export
#'
logLik.TS_fit <- function(object, ...){
  val <- object$logLik
  attr(val, "df") <- object$nparams
  attr(val, "nobs") <- nrow(object$data)
  class(val) <- "logLik"
  val
}
Any scripts or data that you put into this service are public.
LDATS documentation built on Sept. 19, 2023, 5:08 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
LDATS
Latent Dirichlet Allocation Coupled with Time Series Analyses

R/TS.R
In LDATS: Latent Dirichlet Allocation Coupled with Time Series Analyses

Defines functions logLik.TS_fit TS_control check_formula update_pbar prep_pbar est_changepoints est_regressors measure_rho_vcov summarize_rhos measure_eta_vcov summarize_etas print.TS_fit package_TS check_TS_inputs TS

Documented in check_formula check_TS_inputs est_changepoints est_regressors logLik.TS_fit measure_eta_vcov measure_rho_vcov package_TS prep_pbar print.TS_fit summarize_etas summarize_rhos TS TS_control update_pbar

Try the LDATS package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

LDATS Latent Dirichlet Allocation Coupled with Time Series Analyses

R/TS.R In LDATS: Latent Dirichlet Allocation Coupled with Time Series Analyses

Defines functions logLik.TS_fit TS_control check_formula update_pbar prep_pbar est_changepoints est_regressors measure_rho_vcov summarize_rhos measure_eta_vcov summarize_etas print.TS_fit package_TS check_TS_inputs TS

Documented in check_formula check_TS_inputs est_changepoints est_regressors logLik.TS_fit measure_eta_vcov measure_rho_vcov package_TS prep_pbar print.TS_fit summarize_etas summarize_rhos TS TS_control update_pbar

Try the LDATS package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

LDATS
Latent Dirichlet Allocation Coupled with Time Series Analyses

R/TS.R
In LDATS: Latent Dirichlet Allocation Coupled with Time Series Analyses