datawizard: Easy Data Wrangling and Statistical Transformations

Documented in degroup demean

#' Compute group-meaned and de-meaned variables
#'
#' @description
#'
#' `demean()` computes group- and de-meaned versions of a variable that can be
#' used in regression analysis to model the between- and within-subject effect
#' (person-mean centering or centering within clusters). `degroup()` is more
#' generic in terms of the centering-operation. While `demean()` always uses
#' mean-centering, `degroup()` can also use the mode or median for centering.
#'
#' @param x A data frame.
#' @param select Character vector (or formula) with names of variables to select
#'   that should be group- and de-meaned.
#' @param by Character vector (or formula) with the name of the variable that
#'   indicates the group- or cluster-ID. For cross-classified or nested designs,
#'   `by` can also identify two or more variables as group- or cluster-IDs. If
#'   the data is nested and should be treated as such, set `nested = TRUE`. Else,
#'   if `by` defines two or more variables and `nested = FALSE`, a cross-classified
#'   design is assumed. Note that `demean()` and `degroup()` can't handle a mix
#'   of nested and cross-classified designs in one model.
#'
#'   For nested designs, `by` can be:
#'   - a character vector with the name of the variable that indicates the
#'     levels, ordered from *highest* level to *lowest* (e.g.
#'     `by = c("L4", "L3", "L2")`.
#'   - a character vector with variable names in the format `by = "L4/L3/L2"`,
#'     where the levels are separated by `/`.
#'
#'   See also section _De-meaning for cross-classified designs_ and
#'   _De-meaning for nested designs_ below.
#' @param nested Logical, if `TRUE`, the data is treated as nested. If `FALSE`,
#'   the data is treated as cross-classified. Only applies if `by` contains more
#'   than one variable.
#' @param center Method for centering. `demean()` always performs
#'   mean-centering, while `degroup()` can use `center = "median"` or
#'   `center = "mode"` for median- or mode-centering, and also `"min"`
#'   or `"max"`.
#' @param suffix_demean,suffix_groupmean String value, will be appended to the
#'   names of the group-meaned and de-meaned variables of `x`. By default,
#'   de-meaned variables will be suffixed with `"_within"` and
#'   grouped-meaned variables with `"_between"`.
#' @param append Logical, if `TRUE` (default), the group- and de-meaned
#'   variables will be appended (column bind) to the original data `x`,
#'   thus returning both the original and the de-/group-meaned variables.
#' @param add_attributes Logical, if `TRUE`, the returned variables gain
#'   attributes to indicate the within- and between-effects. This is only
#'   relevant when printing `model_parameters()` - in such cases, the
#'   within- and between-effects are printed in separated blocks.
#' @inheritParams center
#'
#' @return
#' A data frame with the group-/de-meaned variables, which get the suffix
#' `"_between"` (for the group-meaned variable) and `"_within"` (for the
#' de-meaned variable) by default. For cross-classified or nested designs,
#' the name pattern of the group-meaned variables is the name of the centered
#' variable followed by the name of the variable that indicates the related
#' grouping level, e.g. `predictor_L3_between` and `predictor_L2_between`.
#'
#' @seealso If grand-mean centering (instead of centering within-clusters)
#'   is required, see [`center()`]. See [`performance::check_heterogeneity_bias()`]
#'   to check for heterogeneity bias.
#'
#' @section Heterogeneity Bias:
#'
#' Mixed models include different levels of sources of variability, i.e.
#' error terms at each level. When macro-indicators (or level-2 predictors,
#' or higher-level units, or more general: *group-level predictors that
#' **vary** within and across groups*) are included as fixed effects (i.e.
#' treated as covariate at level-1), the variance that is left unaccounted for
#' this covariate will be absorbed into the error terms of level-1 and level-2
#' (_Bafumi and Gelman 2006; Gelman and Hill 2007, Chapter 12.6._):
#' "Such covariates contain two parts: one that is specific to the higher-level
#' entity that does not vary between occasions, and one that represents the
#' difference between occasions, within higher-level entities" (_Bell et al. 2015_).
#' Hence, the error terms will be correlated with the covariate, which violates
#' one of the assumptions of mixed models (iid, independent and identically
#' distributed error terms). This bias is also called the *heterogeneity bias*
#' (_Bell et al. 2015_). To resolve this problem, level-2 predictors used as
#' (level-1) covariates should be separated into their "within" and "between"
#' effects by "de-meaning" and "group-meaning": After demeaning time-varying
#' predictors, "at the higher level, the mean term is no longer constrained by
#' Level 1 effects, so it is free to account for all the higher-level variance
#' associated with that variable" (_Bell et al. 2015_).
#'
#' @section Panel data and correlating fixed and group effects:
#'
#' `demean()` is intended to create group- and de-meaned variables for panel
#' regression models (fixed effects models), or for complex
#' random-effect-within-between models (see _Bell et al. 2015, 2018_), where
#' group-effects (random effects) and fixed effects correlate (see
#' _Bafumi and Gelman 2006_). This can happen, for instance, when analyzing
#' panel data, which can lead to *Heterogeneity Bias*. To control for correlating
#' predictors and group effects, it is recommended to include the group-meaned
#' and de-meaned version of *time-varying covariates* (and group-meaned version
#' of *time-invariant covariates* that are on a higher level, e.g. level-2
#' predictors) in the model. By this, one can fit complex multilevel models for
#' panel data, including time-varying predictors, time-invariant predictors and
#' random effects.
#'
#' @section Why mixed models are preferred over fixed effects models:
#'
#' A mixed models approach can model the causes of endogeneity explicitly
#' by including the (separated) within- and between-effects of time-varying
#' fixed effects and including time-constant fixed effects. Furthermore,
#' mixed models also include random effects, thus a mixed models approach
#' is superior to classic fixed-effects models, which lack information of
#' variation in the group-effects or between-subject effects. Furthermore,
#' fixed effects regression cannot include random slopes, which means that
#' fixed effects regressions are neglecting "cross-cluster differences in the
#' effects of lower-level controls (which) reduces the precision of estimated
#' context effects, resulting in unnecessarily wide confidence intervals and
#' low statistical power" (_Heisig et al. 2017_).
#'
#' @section Terminology:
#'
#' The group-meaned variable is simply the mean of an independent variable
#' within each group (or id-level or cluster) represented by `by`. It represents
#' the cluster-mean of an independent variable. The regression coefficient of a
#' group-meaned variable is the *between-subject-effect*. The de-meaned variable
#' is then the centered version of the group-meaned variable. De-meaning is
#' sometimes also called person-mean centering or centering within clusters.
#' The regression coefficient of a de-meaned variable represents the
#' *within-subject-effect*.
#'
#' @section De-meaning with continuous predictors:
#'
#' For continuous time-varying predictors, the recommendation is to include
#' both their de-meaned and group-meaned versions as fixed effects, but not
#' the raw (untransformed) time-varying predictors themselves. The de-meaned
#' predictor should also be included as random effect (random slope). In
#' regression models, the coefficient of the de-meaned predictors indicates
#' the within-subject effect, while the coefficient of the group-meaned
#' predictor indicates the between-subject effect.
#'
#' @section De-meaning with binary predictors:
#'
#' For binary time-varying predictors, there are two recommendations. First
#' is to include the raw (untransformed) binary predictor as fixed effect
#' only and the *de-meaned* variable as random effect (random slope).
#' The alternative would be to add the de-meaned version(s) of binary
#' time-varying covariates as additional fixed effect as well (instead of
#' adding it as random slope). Centering time-varying binary variables to
#' obtain within-effects (level 1) isn't necessary. They have a sensible
#' interpretation when left in the typical 0/1 format (_Hoffmann 2015,
#' chapter 8-2.I_). `demean()` will thus coerce categorical time-varying
#' predictors to numeric to compute the de- and group-meaned versions for
#' these variables, where the raw (untransformed) binary predictor and the
#' de-meaned version should be added to the model.
#'
#' @section De-meaning of factors with more than 2 levels:
#'
#' Factors with more than two levels are demeaned in two ways: first, these
#' are also converted to numeric and de-meaned; second, dummy variables
#' are created (binary, with 0/1 coding for each level) and these binary
#' dummy-variables are de-meaned in the same way (as described above).
#' Packages like **panelr** internally convert factors to dummies before
#' demeaning, so this behaviour can be mimicked here.
#'
#' @section De-meaning interaction terms:
#'
#' There are multiple ways to deal with interaction terms of within- and
#' between-effects.
#'
#' - A classical approach is to simply use the product term of the de-meaned
#'   variables (i.e. introducing the de-meaned variables as interaction term
#'   in the model formula, e.g. `y ~ x_within * time_within`). This approach,
#'   however, might be subject to bias (see _Giesselmann & Schmidt-Catran 2020_).
#'
#' - Another option is to first calculate the product term and then apply the
#'   de-meaning to it. This approach produces an estimator "that reflects
#'   unit-level differences of interacted variables whose moderators vary
#'   within units", which is desirable if *no* within interaction of
#'   two time-dependent variables is required. This is what `demean()` does
#'   internally when `select` contains interaction terms.
#'
#' - A third option, when the interaction should result in a genuine within
#'   estimator, is to "double de-mean" the interaction terms
#'   (_Giesselmann & Schmidt-Catran 2018_), however, this is currently
#'   not supported by `demean()`. If this is required, the `wmb()`
#'   function from the **panelr** package should be used.
#'
#' To de-mean interaction terms for within-between models, simply specify
#' the term as interaction for the `select`-argument, e.g. `select = "a*b"`
#' (see 'Examples').
#'
#' @section De-meaning for cross-classified designs:
#'
#' `demean()` can handle cross-classified designs, where the data has two or
#' more groups at the higher (i.e. second) level. In such cases, the
#' `by`-argument can identify two or more variables that represent the
#'  cross-classified group- or cluster-IDs. The de-meaned variables for
#' cross-classified designs are simply subtracting all group means from each
#' individual value, i.e. _fully cluster-mean-centering_ (see _Guo et al. 2024_
#' for details). Note that de-meaning for cross-classified designs is *not*
#' equivalent to de-meaning of nested data structures from models with three or
#' more levels. Set `nested = TRUE` to explicitly assume a nested design. For
#' cross-classified designs, de-meaning is supposed to work for models like
#' `y ~ x + (1|level3) + (1|level2)`, but *not* for models like
#' `y ~ x + (1|level3/level2)`. Note that `demean()` and `degroup()` can't
#' handle a mix of nested and cross-classified designs in one model.
#'
#' @section De-meaning for nested designs:
#'
#' _Brincks et al. (2017)_ have suggested an algorithm to center variables for
#' nested designs, which is implemented in `demean()`. For nested designs, set
#' `nested = TRUE` *and* specify the variables that indicate the different
#' levels in descending order in the `by` argument. E.g.,
#' `by = c("level4", "level3, "level2")` assumes a model like
#' `y ~ x + (1|level4/level3/level2)`. An alternative notation for the
#' `by`-argument would be `by = "level4/level3/level2"`, similar to the
#' formula notation.
#'
#' @section Analysing panel data with mixed models using lme4:
#'
#' A description of how to translate the formulas described in *Bell et al. 2018*
#' into R using `lmer()` from **lme4** can be found in
#' [this vignette](https://easystats.github.io/parameters/articles/demean.html).
#'
#' @references
#'
#'   - Bafumi J, Gelman A. 2006. Fitting Multilevel Models When Predictors
#'     and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
#'     American Political Science Association.
#'
#'   - Bell A, Fairbrother M, Jones K. 2019. Fixed and Random Effects
#'     Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
#'
#'   - Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects
#'     Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
#'     Research and Methods, 3(1), 133–153.
#'
#'   - Brincks, A. M., Enders, C. K., Llabre, M. M., Bulotsky-Shearer, R. J.,
#'     Prado, G., and Feaster, D. J. (2017). Centering Predictor Variables in
#'     Three-Level Contextual Models. Multivariate Behavioral Research, 52(2),
#'     149–163. https://doi.org/10.1080/00273171.2016.1256753
#'
#'   - Gelman A, Hill J. 2007. Data Analysis Using Regression and
#'     Multilevel/Hierarchical Models. Analytical Methods for Social Research.
#'     Cambridge, New York: Cambridge University Press
#'
#'   - Giesselmann M, Schmidt-Catran, AW. 2020. Interactions in fixed
#'     effects regression models. Sociological Methods & Research, 1–28.
#'     https://doi.org/10.1177/0049124120914934
#'
#'   - Guo Y, Dhaliwal J, Rights JD. 2024. Disaggregating level-specific effects
#'     in cross-classified multilevel models. Behavior Research Methods, 56(4),
#'     3023–3057.
#'
#'   - Heisig JP, Schaeffer M, Giesecke J. 2017. The Costs of Simplicity:
#'     Why Multilevel Models May Benefit from Accounting for Cross-Cluster
#'     Differences in the Effects of Controls. American Sociological Review 82
#'     (4): 796–827.
#'
#'   - Hoffman L. 2015. Longitudinal analysis: modeling within-person
#'     fluctuation and change. New York: Routledge
#'
#' @examples
#'
#' data(iris)
#' iris$ID <- sample(1:4, nrow(iris), replace = TRUE) # fake-ID
#' iris$binary <- as.factor(rbinom(150, 1, .35)) # binary variable
#'
#' x <- demean(iris, select = c("Sepal.Length", "Petal.Length"), by = "ID")
#' head(x)
#'
#' x <- demean(iris, select = c("Sepal.Length", "binary", "Species"), by = "ID")
#' head(x)
#'
#'
#' # demean interaction term x*y
#' dat <- data.frame(
#'   a = c(1, 2, 3, 4, 1, 2, 3, 4),
#'   x = c(4, 3, 3, 4, 1, 2, 1, 2),
#'   y = c(1, 2, 1, 2, 4, 3, 2, 1),
#'   ID = c(1, 2, 3, 1, 2, 3, 1, 2)
#' )
#' demean(dat, select = c("a", "x*y"), by = "ID")
#'
#' # or in formula-notation
#' demean(dat, select = ~ a + x * y, by = ~ID)
#'
#' @export
demean <- function(x,
                   select,
                   by,
                   nested = FALSE,
                   suffix_demean = "_within",
                   suffix_groupmean = "_between",
                   append = TRUE,
                   add_attributes = TRUE,
                   verbose = TRUE) {
  degroup(
    x = x,
    select = select,
    by = by,
    nested = nested,
    center = "mean",
    suffix_demean = suffix_demean,
    suffix_groupmean = suffix_groupmean,
    append = append,
    add_attributes = add_attributes,
    verbose = verbose
  )
}


#' @rdname demean
#' @export
degroup <- function(x,
                    select,
                    by,
                    nested = FALSE,
                    center = "mean",
                    suffix_demean = "_within",
                    suffix_groupmean = "_between",
                    append = TRUE,
                    add_attributes = TRUE,
                    verbose = TRUE) {
  # ugly tibbles again... but save original data frame
  original_data <- x
  x <- .coerce_to_dataframe(x)

  center <- match.arg(tolower(center), choices = c("mean", "median", "mode", "min", "max"))

  if (inherits(select, "formula")) {
    # formula to character, remove "~", split at "+". We don't use `all.vars()`
    # here because we want to keep the interaction terms as they are
    select <- trimws(unlist(
      strsplit(gsub("~", "", insight::safe_deparse(select), fixed = TRUE), "+", fixed = TRUE),
      use.names = FALSE
    ))
  }

  # handle different "by" options
  if (inherits(by, "formula")) {
    by <- all.vars(by)
  }

  # we also allow lme4-syntax here: if by = "L4/L3/L2", we assume a nested design
  if (length(by) == 1 && grepl("/", by, fixed = TRUE)) {
    by <- insight::trim_ws(unlist(strsplit(by, "/", fixed = TRUE), use.names = FALSE))
    nested <- TRUE
  }

  # identify interaction terms
  interactions_no <- select[!grepl("(\\*|\\:)", select)]
  interactions_yes <- select[grepl("(\\*|\\:)", select)]

  # if we have interaction terms that should be de-meaned, calculate the product
  # of the terms first, then demean the product
  if (length(interactions_yes)) {
    interaction_terms <- lapply(strsplit(interactions_yes, "*", fixed = TRUE), trimws)
    product <- lapply(interaction_terms, function(i) do.call(`*`, x[, i]))
    new_dat <- as.data.frame(stats::setNames(product, gsub("\\s", "", gsub("*", "_", interactions_yes, fixed = TRUE))))
    x <- cbind(x, new_dat)
    select <- c(interactions_no, colnames(new_dat))
  }

  # check if all variables are present
  not_found <- setdiff(c(select, by), colnames(x))

  if (length(not_found)) {
    insight::format_error(
      paste0(
        "Variable",
        ifelse(length(not_found) > 1, "s ", " "),
        text_concatenate(not_found, enclose = "\""),
        ifelse(length(not_found) > 1, " were", " was"),
        " not found in the dataset."
      ),
      .misspelled_string(colnames(x), not_found, "Possibly misspelled or not yet defined?")
    )
  }

  # get data to demean...
  dat <- x[, c(select, by)]


  # find categorical predictors that are coded as factors
  categorical_predictors <- vapply(dat[select], is.factor, FUN.VALUE = logical(1L))

  # convert binary predictors to numeric
  if (any(categorical_predictors)) {
    # convert categorical to numeric, and then demean
    dat[select[categorical_predictors]] <- lapply(
      dat[select[categorical_predictors]],
      function(i) as.numeric(i) - 1
    )
    # convert categorical to dummy, and demean each binary dummy
    for (i in select[categorical_predictors]) {
      if (nlevels(x[[i]]) > 2) {
        for (j in levels(x[[i]])) {
          # create vector with zeros
          f <- rep(0, nrow(x))
          # for each matching level, set dummy to 1
          f[x[[i]] == j] <- 1
          dummy <- data.frame(f)
          # colnames = variable name + factor level
          # also add new dummy variables to "select"
          colnames(dummy) <- sprintf("%s_%s", i, j)
          select <- c(select, sprintf("%s_%s", i, j))
          # add to data
          dat <- cbind(dat, dummy)
        }
      }
    }
    # tell user...
    if (isTRUE(verbose)) {
      insight::format_alert(
        paste0(
          "Categorical predictors (",
          toString(names(categorical_predictors)[categorical_predictors]),
          ") have been coerced to numeric values to compute de- and group-meaned variables.\n"
        )
      )
    }
  }


  # group variables, then calculate the mean-value
  # for variables within each group (the group means). assign
  # mean values to a vector of same length as the data

  gm_fun <- switch(center,
    mode = function(.gm) distribution_mode(stats::na.omit(.gm)),
    median = function(.gm) stats::median(.gm, na.rm = TRUE),
    min = function(.gm) min(.gm, na.rm = TRUE),
    max = function(.gm) max(.gm, na.rm = TRUE),
    function(.gm) mean(.gm, na.rm = TRUE)
  )

  # we allow disaggregating level-specific effects for cross-classified multilevel
  # models (see Guo et al. 2024). Two levels should work as proposed by the authors,
  # more levels also already work, but need to check the formula from the paper
  # and validate results

  if (length(by) == 1) {
    # simple case: one level
    group_means_list <- lapply(select, function(i) {
      stats::ave(dat[[i]], dat[[by]], FUN = gm_fun)
    })
    names(group_means_list) <- select
    # create de-meaned variables by subtracting the group mean from each individual value
    person_means_list <- lapply(select, function(i) dat[[i]] - group_means_list[[i]])
  } else if (nested) {
    # nested design: by > 1, nested is explicitly set to TRUE
    # We want:
    # L3_between = xbar(k)
    # L2_between = xbar(j,k) - xbar(k)
    # L1_within = x(ijk) - xbar(jk)
    # , where
    # x(ijk) is the individual value / variable that is measured on level 1
    # xbar(k) <- ave(x_ijk, L3, FUN = mean), the group mean of the variable at highest level
    # xbar(jk) <- ave(x_ijk, L3, L2, FUN = mean), the group mean of the variable at second level
    group_means_list <- lapply(select, function(i) {
      out <- lapply(seq_along(by), function(k) {
        dat$higher_levels <- do.call(paste, c(dat[by[1:k]], list(sep = "_")))
        stats::ave(dat[[i]], dat$higher_levels, FUN = gm_fun)
      })
      # subtract mean of higher level from lower level
      for (j in 2:length(by)) {
        out[[j]] <- out[[j]] - out[[j - 1]]
      }
      names(out) <- paste0(select, "_", by)
      out
    })
    # create de-meaned variables by subtracting the group mean from each individual value
    person_means_list <- lapply(
      # seq_along(select),
      # function(i) dat[[select[i]]] - group_means_list[[i]][[length(by)]]
      select,
      function(i) {
        dat$higher_levels <- do.call(paste, c(dat[by], list(sep = "_")))
        dat[[i]] - stats::ave(dat[[i]], dat$higher_levels, FUN = gm_fun)
      }
    )
  } else {
    # cross-classified design: by > 1
    group_means_list <- lapply(by, function(j) {
      out <- lapply(select, function(i) {
        stats::ave(dat[[i]], dat[[j]], FUN = gm_fun)
      })
      names(out) <- paste0(select, "_", j)
      out
    })
    # de-meaned variables for cross-classified design is simply subtracting
    # all group means from each individual value
    person_means_list <- lapply(seq_along(select), function(i) {
      sum_group_means <- do.call(`+`, lapply(group_means_list, function(j) j[[i]]))
      dat[[select[i]]] - sum_group_means
    })
  }

  # preserve names
  names(person_means_list) <- select

  # convert to data frame and add suffix to column names

  group_means <- as.data.frame(group_means_list)
  person_means <- as.data.frame(person_means_list)

  colnames(person_means) <- sprintf("%s%s", colnames(person_means), suffix_demean)
  colnames(group_means) <- sprintf("%s%s", colnames(group_means), suffix_groupmean)

  if (isTRUE(add_attributes)) {
    person_means[] <- lapply(person_means, function(i) {
      attr(i, "within-effect") <- TRUE
      i
    })
    group_means[] <- lapply(group_means, function(i) {
      attr(i, "between-effect") <- TRUE
      i
    })
  }

  # between and within effects
  out <- cbind(group_means, person_means)

  # append to original data?
  if (isTRUE(append)) {
    # check for unique column names
    duplicated_columns <- intersect(colnames(out), colnames(original_data))
    if (length(duplicated_columns)) {
      insight::format_error(paste0(
        "One or more of the centered variables already exist in the orignal data frame: ", # nolint
        text_concatenate(duplicated_columns, enclose = "`"),
        ". Please rename the affected variable(s) in the original data, or use the arguments `suffix_demean` and `suffix_groupmean` to rename the centered variables." # nolint
      ))
    }
    out <- cbind(original_data, out)
  }

  out
}


#' @rdname demean
#' @export
detrend <- degroup