R/modsem_da.R

Defines functions modsem_da

Documented in modsem_da

#' Interaction between latent variables using LMS and QML approaches
#'
#' @param model.syntax \code{lavaan} syntax
#'
#' @param data A dataframe with observed variables used in the model.
#'
#' @param method method to use:
#' \describe{
#'   \item{\code{"lms"}}{latent moderated structural equations (not passed to \code{lavaan}).}
#'   \item{\code{"qml"}}{quasi maximum likelihood estimation (not passed to \code{lavaan}).}
#' }
#'
#' @param verbose should estimation progress be shown
#'
#' @param optimize should starting parameters be optimized
#'
#' @param nodes number of quadrature nodes (points of integration) used in \code{lms},
#'   increased number gives better estimates but slower computation. How many are needed depends on the complexity of the model.
#'   For simple models, somewhere between 16-24 nodes should be enough; for more complex models, higher numbers may be needed.
#'   For models where there is an interaction effect between an endogenous and exogenous variable,
#'   the number of nodes should be at least 32, but practically (e.g., ordinal/skewed data), more than 32 is recommended. In cases
#'   where data is non-normal, it might be better to use the \code{qml} approach instead.
#'   You can also consider setting \code{adaptive.quad = TRUE}.
#'
#' @param missing How should missing values be handled? If \code{"listwise"} (default) missing values
#'   are removed list-wise (alias: \code{"complete"} or \code{"casewise"}).
#'   If \code{impute} values are imputed using \code{Amelia::amelia}.
#'   If \code{"fiml"} (alias: \code{"ml"} or \code{"direct"}), full information maximum
#'   likelihood (FIML) is used. FIML can be (very) computationally intensive.
#'
#' @param convergence.abs Absolute convergence criterion.
#'   Lower values give better estimates but slower computation. Not relevant when
#'   using the QML approach. For the LMS approach the EM-algorithm stops whenever
#'   the relative or absolute convergence criterion is reached.
#'
#' @param convergence.rel Relative convergence criterion.
#'   Lower values give better estimates but slower computation.
#'   For the LMS approach the EM-algorithm stops whenever
#'   the relative or absolute convergence criterion is reached.
#'
#' @param optimizer optimizer to use, can be either \code{"nlminb"} or \code{"L-BFGS-B"}. For LMS, \code{"nlminb"} is recommended.
#'   For QML, \code{"L-BFGS-B"} may be faster if there is a large number of iterations, but slower if there are few iterations.
#'
#' @param center.data should data be centered before fitting model
#'
#' @param standardize.data should data be scaled before fitting model, will be overridden by
#'   \code{standardize} if \code{standardize} is set to \code{TRUE}.
#'
#' @param standardize.out should output be standardized (note will alter the relationships of
#'   parameter constraints since parameters are scaled unevenly, even if they
#'   have the same label). This does not alter the estimation of the model, only the
#'   output.
#'
#' \strong{NOTE}: It is recommended that you estimate the model normally and then standardize the output using
#' \code{\link{standardize_model}}, \code{\link{standardized_estimates}} or \code{summary(<modsem_da-object>, standardize=TRUE)}.
#'
#' @param mean.observed should the mean structure of the observed variables be estimated?
#'   This will be overridden by \code{standardize}, if \code{standardize} is set to \code{TRUE}.
#'
#' \strong{NOTE}: Not recommended unless you know what you are doing.
#'
#' @param standardize will standardize the data before fitting the model, remove the mean
#'   structure of the observed variables, and standardize the output. Note that \code{standardize.data},
#'   \code{mean.observed}, and \code{standardize.out} will be overridden by \code{standardize} if \code{standardize} is set to \code{TRUE}.
#'
#' \strong{NOTE}: It is recommended that you estimate the model normally and then standardize the output using
#'   \code{\link{standardized_estimates}}.
#'
#' @param double try to double the number of dimensions of integration used in LMS,
#' this will be extremely slow but should be more similar to \code{mplus}.
#'
#' @param cov.syntax model syntax for implied covariance matrix of exogenous latent variables
#'  (see \code{vignette("interaction_two_etas", "modsem")}).
#'
#' @param calc.se should standard errors be computed? \strong{NOTE}: If \code{FALSE}, the information matrix will not be computed either.
#'
#' @param FIM should the Fisher information matrix be calculated using the observed or expected values? Must be either \code{"observed"} or \code{"expected"}.
#'
#' @param EFIM.S if the expected Fisher information matrix is computed, \code{EFIM.S} selects the number of Monte Carlo samples. Defaults to 100.
#'   \strong{NOTE}: This number should likely be increased for better estimates (e.g., 1000), but it might drasticly increase computation time.
#'
#' @param OFIM.hessian Logical. If \code{TRUE} (default), standard errors are
#'   based on the negative Hessian (observed Fisher information).
#'   If \code{FALSE}, they come from the outer product
#'   of individual score vectors (OPG). For correctly specified models,
#'   these two matrices are asymptotically equivalent; yielding nearly identical
#'   standard errors in large samples. The Hessian usually shows smaller finite-sample
#'   variance (i.e., it's more consistent), and is therefore the default.
#'
#'   Note, that the Hessian is not always positive definite, and is more computationally
#'   expensive to calculate. The OPG should always be positive definite, and a lot
#'   faster to compute. If the model is correctly specified, and the sample size is large,
#'   then the two should yield similar results, and switching to the OPG can save a
#'   lot of time. Note, that the required sample size depends on the complexity of the model.
#'
#'   A large difference between Hessian and OPG suggests misspecification, and
#'   \code{robust.se = TRUE} should be set to obtain sandwich (robust) standard errors.
#'
#' @param EFIM.parametric should data for calculating the expected Fisher information matrix be
#'   simulated parametrically (simulated based on the assumptions and implied parameters
#'   from the model), or non-parametrically (stochastically sampled)? If you believe that
#'   normality assumptions are violated, \code{EFIM.parametric = FALSE} might be the better option.
#'
#' @param R.max Maximum population size (not sample size) used in the calculated of the expected
#'   fischer information matrix.
#'
#' @param robust.se should robust standard errors be computed, using the sandwich estimator?
#'
#' @param max.iter maximum number of iterations.
#'
#' @param max.step maximum steps for the M-step in the EM algorithm (LMS).
#'
#' @param start starting parameters.
#'
#' @param epsilon finite difference for numerical derivatives.
#'
#' @param quad.range range in z-scores to perform numerical integration in LMS using,
#'   when using quasi-adaptive Gaussian-Hermite Quadratures. By default \code{Inf}, such that \code{f(t)} is integrated from -Inf to Inf,
#'   but this will likely be inefficient and pointless at a large number of nodes. Nodes outside
#'   \code{+/- quad.range} will be ignored.
#'
#' @param adaptive.quad should a quasi adaptive quadrature be used? If \code{TRUE}, the quadrature nodes will be adapted to the data.
#'   If \code{FALSE}, the quadrature nodes will be fixed. Default is \code{FALSE}. The adaptive quadrature does not fit an adaptive
#'   quadrature to each participant, but instead tries to place more nodes where posterior distribution is highest. Compared with a
#'   fixed Gauss Hermite quadrature this usually means that less nodes are placed at the tails of the distribution.
#'
#' @param adaptive.frequency How often should the quasi-adaptive quadrature be calculated? Defaults to 3, meaning
#'   that it is recalculated every third EM-iteration.
#'
#' @param adaptive.quad.tol Relative error tolerance for quasi adaptive quadrature. Defaults to \code{1e-12}.
#'
#' @param n.threads number of threads to use for parallel processing. If \code{NULL}, it will use <= 2 threads.
#'   If an integer is specified, it will use that number of threads (e.g., \code{n.threads = 4} will use 4 threads).
#'   If \code{"default"}, it will use the default number of threads (2).
#'   If \code{"max"}, it will use all available threads, \code{"min"} will use 1 thread.
#'
#' @param algorithm algorithm to use for the EM algorithm. Can be either \code{"EM"} or \code{"EMA"}.
#'   \code{"EM"} is the standard EM algorithm. \code{"EMA"} is an
#'   accelerated EM procedure that uses Quasi-Newton and Fisher Scoring
#'   optimization steps when needed. Default is \code{"EM"}.
#'
#' @param em.control a list of control parameters for the EM algorithm. See \code{\link{default_settings_da}} for defaults.
#'
#' @param ordered Variables to be treated as ordered. The scale of the ordinal variables
#'   is scaled to correct for unequal intervals. The underlying continous distributions
#'   are estimated using a Monte-Carlo bootstrap approach. The ordinal values are replaced with
#'   the expected values for each interval. Using \code{ordered=TRUE} should yield estimates
#'   which are more robust to unequal intervals in ordinal variables. I.e., the estimates
#'   should be more consistent, and less biased.
#'
#' @param ordered.iter Maximum number of sampling iterations used to sample the underlying continuous distribution of the
#'   ordinal variables. The default is set to \code{100}.
#'
#' @param ordered.warmup Number of sampling iterations in the warmup phase.
#'
#' @param cluster Clusters used to compute standard errors robust to non-indepence of observations. Must be paired with
#'   \code{robust.se = TRUE}.
#'
#' @param cr1s Logical; if \code{TRUE}, apply the CR1S small-sample correction factor
#'   to the cluster-robust variance estimator. The CR1S factor is
#'   \eqn{(G / (G - 1)) \cdot ((N - 1) / (N - q))}, where \eqn{G} is the number of
#'   clusters, \eqn{N} is the total number of observations, and \eqn{q} is the number
#'   of free parameters. This adjustment inflates standard errors to reduce the
#'   small-sample downward bias present in the basic cluster-robust (CR0) estimator,
#'   especially when \eqn{G} is small. If \code{FALSE}, the unadjusted CR0 estimator
#'   is used. Defaults to \code{TRUE}. Only relevant if \code{cluster} is specified.
#'
#' @param rcs Should latent variable indicators be replaced with reliability-corrected
#'   single item indicators instead? See \code{\link{relcorr_single_item}}.
#'
#' @param rcs.choose Which latent variables should get their indicators replaced with
#'   reliability-corrected single items? It is passed to \code{\link{relcorr_single_item}}
#'   as the \code{choose} argument.
#'
#' @param rcs.scale.corrected Should reliability-corrected items be scale-corrected? If \code{TRUE}
#'   reliability-corrected single items are corrected for differences in factor loadings between
#'   the items. Default is \code{TRUE}.
#'
#' @param orthogonal.x If \code{TRUE}, all covariances among exogenous latent variables only are set to zero.
#'  Default is \code{FALSE}.
#'
#' @param orthogonal.y If \code{TRUE}, all covariances among endogenous latent variables only are set to zero.
#'  If \code{FALSE} residual covariances are added between pure endogenous variables;
#'  those that are predicted by no other endogenous variable in the structural model.
#'  Default is \code{FALSE}.
#'
#' @param orthogonal.y If \code{TRUE}, all covariances among endogenous latent variables only are set to zero.
#'  If \code{FALSE} residual covariances are added between pure endogenous variables;
#'  those that are predicted by no other endogenous variable in the structural model.
#'  Default is \code{FALSE}.
#'
#' @param auto.fix.first If \code{TRUE} the factor loading of the first indicator, for
#'  a given latent variable is fixed to \code{1}. If \code{FALSE} no loadings are fixed
#'  (automatically). Note that that this might make it such that the model no longer is
#'  identified. Default is \code{TRUE}. \strong{NOTE} this behaviour is overridden
#'  if the first loading is labelled, where it gets treated as a free parameter instead. This
#'  differs from the default behaviour in \code{lavaan}.
#'
#' @param auto.fix.single If \code{TRUE}, the residual variance of
#'  an observed indicator is set to zero if it is the only indicator of a latent variable.
#'  If \code{FALSE} the residual variance is not fixed to zero, and treated as a free parameter
#'  of the model. Default is \code{TRUE}. \strong{NOTE} this behaviour is overridden
#'  if the first loading is labelled, where it gets treated as a free parameter instead.
#'
#' @param auto.split.syntax Should the model syntax automatically be split into a
#'   linear and non-linear part? This is done by moving the structural model for
#'   linear endogenous variables (used in interaction terms) into the \code{cov.syntax}
#'   argument. This can potentially allow interactions between two endogenous variables
#'   given that both are linear (i.e., not affected by interaction terms). This is
#'   \code{FALSE} by default for the LMS approach.
#'   When using the QML approach interation effects between exogenous and endogenous
#'   variables can in some cases be biased, if the model is not split beforehand.
#'   The default is therefore \code{TRUE} for the QML approach.
#'
#' @param ... additional arguments to be passed to the estimation function.
#'
#' @return \code{modsem_da} object
#' @export
#'
#' @description
#' \code{modsem_da()} is a function for estimating interaction effects between latent variables
#' in structural equation models (SEMs) using distributional analytic (DA) approaches.
#' Methods for estimating interaction effects in SEMs can basically be split into
#' two frameworks:
#' 1. Product Indicator-based approaches (\code{"dblcent"}, \code{"rca"}, \code{"uca"},
#' \code{"ca"}, \code{"pind"})
#' 2. Distributionally based approaches (\code{"lms"}, \code{"qml"}).
#'
#' \code{modsem_da()} handles the latter and can estimate models using both QML and LMS,
#' necessary syntax, and variables for the estimation of models with latent product indicators.
#'
#' \strong{NOTE}: Run \code{\link{default_settings_da}} to see default arguments.
#'
#' @examples
#' library(modsem)
#' # For more examples, check README and/or GitHub.
#' # One interaction
#' m1 <- "
#'   # Outer Model
#'   X =~ x1 + x2 +x3
#'   Y =~ y1 + y2 + y3
#'   Z =~ z1 + z2 + z3
#'
#'   # Inner model
#'   Y ~ X + Z + X:Z
#' "
#'
#' \dontrun{
#' # QML Approach
#' est_qml <- modsem_da(m1, oneInt, method = "qml")
#' summary(est_qml)
#'
#' # Theory Of Planned Behavior
#' tpb <- "
#' # Outer Model (Based on Hagger et al., 2007)
#'   ATT =~ att1 + att2 + att3 + att4 + att5
#'   SN =~ sn1 + sn2
#'   PBC =~ pbc1 + pbc2 + pbc3
#'   INT =~ int1 + int2 + int3
#'   BEH =~ b1 + b2
#'
#' # Inner Model (Based on Steinmetz et al., 2011)
#'   INT ~ ATT + SN + PBC
#'   BEH ~ INT + PBC
#'   BEH ~ INT:PBC
#' "
#'
#' # LMS Approach
#' est_lms <- modsem_da(tpb, data = TPB, method = "lms")
#' summary(est_lms)
#' }
modsem_da <- function(model.syntax = NULL,
                      data = NULL,
                      method = "lms",
                      verbose = NULL,
                      optimize = NULL,
                      nodes = NULL,
                      missing = NULL,
                      convergence.abs = NULL,
                      convergence.rel = NULL,
                      optimizer = NULL,
                      center.data = NULL,
                      standardize.data = NULL,
                      standardize.out = NULL,
                      standardize = NULL,
                      mean.observed = NULL,
                      cov.syntax = NULL,
                      double = NULL,
                      calc.se = NULL,
                      FIM = NULL,
                      EFIM.S = NULL,
                      OFIM.hessian = NULL,
                      EFIM.parametric = NULL,
                      robust.se = NULL,
                      R.max = NULL,
                      max.iter = NULL,
                      max.step = NULL,
                      start = NULL,
                      epsilon = NULL,
                      quad.range = NULL,
                      adaptive.quad = NULL,
                      adaptive.frequency = NULL,
                      adaptive.quad.tol = NULL,
                      n.threads = NULL,
                      algorithm = NULL,
                      em.control = NULL,
                      ordered = NULL,
                      ordered.iter = 100L,
                      ordered.warmup = 25L,
                      cluster = NULL,
                      cr1s = FALSE,
                      rcs = FALSE,
                      rcs.choose = NULL,
                      rcs.scale.corrected = TRUE,
                      orthogonal.x = NULL,
                      orthogonal.y = NULL,
                      auto.fix.first = NULL,
                      auto.fix.single = NULL,
                      auto.split.syntax = NULL,
                      ...) {
  if (is.null(model.syntax)) {
    stop2("No model.syntax provided")
  } else if (!is.character(model.syntax)) {
    stop2("The provided model syntax is not a string!")
  } else if (length(model.syntax) > 1) {
    stop2("The provided model syntax is not of length 1")
  }

  if (length(ordered) || any(sapply(data, FUN = is.ordered))) {
    out <- modsemOrderedScaleCorrection(
       model.syntax        = model.syntax,
       data                = data,
       method              = method,
       verbose             = verbose,
       iter                = ordered.iter,
       warmup              = ordered.warmup,
       optimize            = optimize,
       nodes               = nodes,
       missing             = missing,
       convergence.abs     = convergence.abs,
       convergence.rel     = convergence.rel,
       optimizer           = optimizer,
       center.data         = center.data,
       standardize.data    = standardize.data,
       standardize.out     = standardize.out,
       standardize         = standardize,
       mean.observed       = mean.observed,
       cov.syntax          = cov.syntax,
       double              = double,
       calc.se             = calc.se,
       FIM                 = FIM,
       EFIM.S              = EFIM.S,
       OFIM.hessian        = OFIM.hessian,
       EFIM.parametric     = EFIM.parametric,
       robust.se           = robust.se,
       R.max               = R.max,
       max.iter            = max.iter,
       max.step            = max.step,
       start               = start,
       epsilon             = epsilon,
       quad.range          = quad.range,
       adaptive.quad       = adaptive.quad,
       adaptive.frequency  = adaptive.frequency,
       adaptive.quad.tol   = adaptive.quad.tol,
       n.threads           = n.threads,
       algorithm           = algorithm,
       em.control          = em.control,
       ordered             = ordered,
       cluster             = cluster,
       cr1s                = cr1s,
       rcs                 = rcs,
       rcs.choose          = rcs.choose,
       rcs.scale.corrected = rcs.scale.corrected,
       orthogonal.x        = orthogonal.x,
       orthogonal.y        = orthogonal.y,
       auto.fix.first      = auto.fix.first,
       auto.fix.single     = auto.fix.single,
       auto.split.syntax   = auto.split.syntax,
       ...)

    return(out)
  }

  if (is.null(data)) {
    stop2("No data provided")
  } else if (!is.data.frame(data)) {
    data <- as.data.frame(data)
  }

  if (rcs) { # use reliability-correct single items?
    corrected <- relcorr_single_item(
      syntax          = model.syntax,
      data            = data,
      choose          = rcs.choose,
      scale.corrected = rcs.scale.corrected,
      warn.lav        = FALSE
    )

    model.syntax <- corrected$syntax
    data         <- corrected$data
  }

  if ("convergence" %in% names(list(...))) {
    convergence.rel <- list(...)$convergence
    warning2("Argument 'convergence' is deprecated, use 'convergence.rel' instead.")
  }

  args <-
    getMethodSettingsDA(method,
      args =
        list(
          verbose            = verbose,
          optimize           = optimize,
          nodes              = nodes,
          convergence.abs    = convergence.abs,
          convergence.rel    = convergence.rel,
          optimizer          = optimizer,
          center.data        = center.data,
          standardize.data   = standardize.data,
          standardize.out    = standardize.out,
          standardize        = standardize,
          mean.observed      = mean.observed,
          double             = double,
          calc.se            = calc.se,
          FIM                = FIM,
          EFIM.S             = EFIM.S,
          OFIM.hessian       = OFIM.hessian,
          EFIM.parametric    = EFIM.parametric,
          robust.se          = robust.se,
          R.max              = R.max,
          max.iter           = max.iter,
          max.step           = max.step,
          epsilon            = epsilon,
          quad.range         = quad.range,
          adaptive.quad      = adaptive.quad,
          adaptive.frequency = adaptive.frequency,
          adaptive.quad.tol  = adaptive.quad.tol,
          n.threads          = n.threads,
          algorithm          = algorithm,
          em.control         = em.control,
          missing            = missing,
          orthogonal.x       = orthogonal.x,
          orthogonal.y       = orthogonal.y,
          auto.fix.first     = auto.fix.first,
          auto.fix.single    = auto.fix.single,
          auto.split.syntax  = auto.split.syntax,
          cr1s               = cr1s
        )
    )

  stopif(!method %in% c("lms", "qml"), "Method must be either 'lms' or 'qml'")

  if (args$center.data) {
    data <- lapplyDf(data, FUN = function(x) x - mean(x, na.rm = TRUE))
  }

  if (args$standardize.data) {
    data <- lapplyDf(data, FUN = scaleIfNumeric, scaleFactor = FALSE)
  }

  model <- specifyModelDA(model.syntax,
    data               = data,
    method             = method,
    m                  = args$nodes,
    cov.syntax         = cov.syntax,
    mean.observed      = args$mean.observed,
    double             = args$double,
    quad.range         = args$quad.range,
    adaptive.quad      = args$adaptive.quad,
    adaptive.frequency = args$adaptive.frequency,
    missing            = args$missing,
    orthogonal.x       = args$orthogonal.x,
    orthogonal.y       = args$orthogonal.y,
    auto.fix.first     = args$auto.fix.first,
    auto.fix.single    = args$auto.fix.single,
    auto.split.syntax  = args$auto.split.syntax,
    cluster            = cluster
  )

  if (args$optimize) {
    model <- tryCatch({
      result <- purrr::quietly(optimizeStartingParamsDA)(model, args = args)
      warnings <- result$warnings

      if (length(warnings)) {
        fwarnings <- paste0(
          paste0(seq_along(warnings), ". ", warnings),
          collapse = "\n"
        )

        warning2("warning when optimizing starting parameters:\n", fwarnings)
      }

      result$result

    }, error = function(e) {
      warning2("unable to optimize starting parameters:\n", e)
      model
    })
  }

  if (!is.null(start)) {
    checkStartingParams(start, model = model) # throws error if somethings wrong
    model$theta <- start
  }

  # We want to limit the number of threads available to OpenBLAS.
  # Depending on the OpenBLAS version, it might not be compatible with
  # OpenMP. If `n.blas > 1L` you might end up getting this message:
  #> OpenBLAS Warning : Detect OpenMP Loop and this application may hang.
  #>                    Please rebuild the library with USE_OPENMP=1 option.
  # We don't want to restrict OpenBLAS in any other setttings in other settings,
  # e.g., lavaan::sem, so we reset after the model has been estimated.
  setThreads(n = args$n.threads, n.blas = 1L)
  on.exit(resetThreads()) # clean up at end of function

  est <- tryCatch(switch(method,
    qml = estQml(model,
      verbose         = args$verbose,
      convergence     = args$convergence.rel,
      calc.se         = args$calc.se,
      FIM             = args$FIM,
      EFIM.S          = args$EFIM.S,
      OFIM.hessian    = args$OFIM.hessian,
      EFIM.parametric = args$EFIM.parametric,
      robust.se       = args$robust.se,
      max.iter        = args$max.iter,
      epsilon         = args$epsilon,
      optimizer       = args$optimizer,
      R.max           = args$R.max,
      cr1s            = args$cr1s,
      ...
    ),
    lms = emLms(model,
      verbose           = args$verbose,
      convergence.abs   = args$convergence.abs,
      convergence.rel   = args$convergence.rel,
      calc.se           = args$calc.se,
      FIM               = args$FIM,
      EFIM.S            = args$EFIM.S,
      OFIM.hessian      = args$OFIM.hessian,
      EFIM.parametric   = args$EFIM.parametric,
      robust.se         = args$robust.se,
      max.iter          = args$max.iter,
      max.step          = args$max.step,
      epsilon           = args$epsilon,
      optimizer         = args$optimizer,
      R.max             = args$R.max,
      em.control        = args$em.control,
      algorithm         = args$algorithm,
      adaptive.quad     = args$adaptive.quad,
      quad.range        = args$quad.range,
      adaptive.quad.tol = args$adaptive.quad.tol,
      nodes             = args$nodes,
      cr1s              = args$cr1s,
      ...
  )),
  error = function(e) {
    if (args$verbose) cat("\n")
    message <- paste0("modsem [%s]: Model estimation failed!\n",
                      "Message: %s")
    stop2(sprintf(message, method, e$message))
  })

  # Finalize the model object
  # Expected means and covariances
  est$expected.matrices <- tryCatch(
    calcExpectedMatricesDA(
      parTable = est$parTable,
      xis  = getXisModelDA(model), # taking both the main model and cov model into account
      etas = getEtasModelDA(model)  # taking both the main model and cov model into account
    ),
    error = function(e) {
      warning2("Failed to calculate expected matrices: ", e$message)
      NULL
    })

  # Arguments
  est$args <- args
  class(est) <- c("modsem_da", "modsem")

  # Check the results
  postCheckModel(est)

  # Return
  if (args$standardize.out) standardize_model(est) else est
}

Try the modsem package in your browser

Any scripts or data that you put into this service are public.

modsem documentation built on Aug. 27, 2025, 9:08 a.m.