library(knitr) knitr::opts_chunk$set( package.startup.message = FALSE, dpi = 300, collapse = TRUE, comment = "#>", warning = FALSE, message = TRUE ) options( knitr.kable.NA = "", digits = 2, width = 800, modelbased_join_dots = FALSE, modelbased_select = "minimal" ) pkgs <- c("lme4", "marginaleffects", "collapse", "Formula", "palmerpenguins") successfully_loaded <- suppressPackageStartupMessages(vapply(pkgs, requireNamespace, FUN.VALUE = logical(1L), quietly = TRUE)) can_evaluate <- all(successfully_loaded) if (can_evaluate) { knitr::opts_chunk$set(eval = TRUE) junk <- suppressMessages(suppressPackageStartupMessages(vapply(pkgs, require, FUN.VALUE = logical(1L), quietly = TRUE, character.only = TRUE))) } else { knitr::opts_chunk$set(eval = FALSE) }
Mixed models, with their ability to account for hierarchical or clustered data, offer powerful tools for understanding complex relationships. A key aspect of interpreting these models is understanding estimated marginal means (EMMs), which represent the predicted outcome for specific groups or conditions, while holding other variables constant. However, calculating EMMs in mixed models is not always straightforward, and the results can vary depending on the approach taken.
One crucial distinction is between conditional and marginal predictions (or effects). Conditional predictions are specific to a particular level of the random effect (e.g., the predicted outcome for a specific individual in a study). Marginal predictions, on the other hand, average over the random effects, providing an overall estimate of the effect in the population. This is a crucial difference, as the marginal effect is often the quantity of interest when we want to generalize to the population [@heiss_marginal_2022].
Based on the definitions from Heiss (2022), we can say:
This conditional vs. marginal distinction applies to any sort of hierarchical structure in multilevel models:
- Conditional effect = the effect of a variable in an average cluster (i.e., group-specific, subject-specific or cluster-specific effect, or an average or a typical cluster)
- Marginal effect = effect of a variable across clusters on average (i.e., global/population-level effect, or clusters on average).
When working with mixed models, the modelbased
package offers flexibility in calculating EMMs through different backends. The "marginaleffects"
backend and the "emmeans"
backend employ different underlying methodologies. marginaleffects typically focuses on marginal predictions by averaging over the random effects, while emmeans provides conditional predictions. Consequently, the EMMs obtained using these two backends may differ. As we will see, conditional and marginal predictions for fixed effects focal predictors...
estimate = "average"
); by imbalanced we mean not equally distributed levels.Note:
If you request unit-level predictions - that is, predictions specific to the individual levels of the random effects (achieved by including random effect variables in
by
) - conditional and marginal predictions will also differ in linear mixed models. We will not delve into unit-level predictions in this vignette.
In essence, the choice of backend and the understanding of whether we are looking at conditional or marginal predictions are critical for correctly interpreting the results of mixed models. Carefully considering the research question and the nature of the random effects will guide the selection of the appropriate approach.
Technical Notes:
For
backend = "marginaleffects"
, there.form
argument is set toNULL
for mixed models by default, to calculate marginal predictions. You can use for instancere.form = NA
in yourestimate_means()
call to change the default value (NA
will produce conditional predictions).By default, both backends calculate predictions for a balanced data grid representing all combinations of focal predictor levels (specified in
by
). This represents a "typical" observation based on the data grid and is useful for comparing groups. Settingestimate = "average"
can be useful to calculate the average expected outcome from those observations from the sample at hand, however, this option is only available forbackend = "marginaleffects"
.
This vignette shows some examples to demonstrate where results are similar and where they differ.
This section demonstrates the calculation of estimated marginal means (EMMs) in a linear mixed model using balanced data. We'll use the sleepstudy
dataset from the lme4 package.
In this example, we fit a linear mixed model predicting Reaction
based on Days
, with random intercepts and slopes for Subject
. We then calculate the EMMs for Days
using both the default "marginaleffects"
backend and the "emmeans"
backend. Because the data is balanced and we have a linear, the results are similar (exception: we find small differences for the confidence intervals, but ignore this for now...).
library(modelbased) data(sleepstudy, package = "lme4") # for later, create a slightly imbalanced distributed predictor set.seed(1234) sleepstudy$x <- as.factor(sample.int(3, nrow(sleepstudy), replace = TRUE)) model <- lme4::lmer(Reaction ~ Days + (1 + Days | Subject), data = sleepstudy) # default, marginaleffects backend (marginal predictions) # in this case, same result as for conditional predictions estimate_means(model, "Days") # emmeans backend, always conditional predictions estimate_means(model, "Days", backend = "emmeans")
Here, we calculate the marginal predictions for the Days
variable using the estimate = "average"
argument. In this case, because the model is linear and the data is balanced, the predictions are again the same as the previous results, i.e., averaging across all observations makes no difference here.
# marginal predictions, averaged across all observations, # same as conditional predictions above estimate_means(model, "Days", estimate = "average")
This section explores the impact of imbalanced data on EMM calculations in linear mixed models. We'll use the penguins
dataset, which has imbalanced groups that we use as higher-level unit, as well as imbalanced predictors. Since we have still a linear mixed model, marginal and conditional predictions are still similar.
data(penguins, package = "palmerpenguins") model <- lme4::lmer(bill_length_mm ~ sex + island + (1 | species), data = penguins) # marginal predictions estimate_means(model, "sex") # conditional predictions estimate_means(model, "sex", backend = "emmeans")
Since we have imbalanced data, results for "average" predictions will differ from the results above, because both the "emmeans"
backend and the default estimate
setting for the "marginaleffects"
backend produce "data grid based" predictions, which do not average across all observations (see also the documentation of the estimate
argument for further details).
Furthermore, for imbalanced data, conditional and marginal predictions will differ when they are averaged across all observations.
# average marginal predictions estimate_means(model, "sex", estimate = "average") # average conditional predictions estimate_means(model, "sex", estimate = "average", re.form = NA)
The generalized linear mixed model (GLMM) example, using a Poisson distribution to model fish counts, demonstrates the substantial differences between marginal and conditional predictions in non-linear models. When examining the effect of the camper
variable, marginal predictions yield higher estimated means (1.26 for camper = 0
and 3.21 for camper = 1
) compared to conditional predictions (using re.form = NA
or the "emmeans"
backend), which are lower (0.66 for camper = 0
and 1.68 for camper = 1
).
data("fish", package = "insight") model <- lme4::glmer( count ~ child + camper + (1 | persons), data = fish, family = poisson() ) # marginal predictions estimate_means(model, "camper") # conditional predictions estimate_means(model, "camper", re.form = NA) # conditional predictions estimate_means(model, "camper", backend = "emmeans")
Furthermore, when using estimate = "average"
to average across all observations, the differences between marginal and conditional predictions persist, with marginal predictions showing higher means (1.52 and 4.54) than conditional predictions (1.20 and 3.19), but both results differ from the predictions for "typical" observations based on the default estimate
option (or "emmeans"
backend), as shown in the previous output.
# average marginal predictions estimate_means(model, "camper", estimate = "average") # average conditional predictions estimate_means(model, "camper", estimate = "average", re.form = NA)
This highlights that in GLMMs, the choice between marginal and conditional predictions significantly impacts the estimated means and, consequently, the interpretation of the model's results. In this case, the marginal predictions are higher than the conditional predictions.
This vignette has demonstrated the nuances of calculating estimated marginal means (EMMs) in mixed models, highlighting the critical distinction between conditional and marginal predictions. While conditional predictions focus on specific levels of random effects, marginal predictions average over them, providing a population-level perspective. We've seen that in linear mixed models with balanced data, these two approaches often yield similar results. However, in generalized linear mixed models (GLMMs), the differences between conditional and marginal predictions can be substantial.
Furthermore, the default approach of calculating predictions for a balanced data grid, while useful for comparing groups, does not necessarily reflect the actual distribution of observations in the sample. For imbalanced data, this can yield significantly different results even in linear mixed models in comparison to average predictions (using estimate = "average"
).
Therefore, we recommend combining marginal predictions (using the default backend = "marginaleffects"
) with averaging across all observations (by setting estimate = "average"
) when working with mixed models. This approach effectively incorporates the variation inherent in the random effects (higher-level units) and provides EMMs that more accurately reflect the overall patterns observed within the actual sample data. By averaging over the observed data, we obtain a more robust and representative estimate of the population-level effects, making it a preferred strategy for interpreting mixed model results.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.