knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  rows.print = 25
)

# several packages required to compute the vignette
library(broom.helpers)
if (
  !.assert_package("gtsummary", boolean = TRUE) ||
    !.assert_package("ggstats", boolean = TRUE) ||
    !.assert_package("margins", boolean = TRUE) ||
    !.assert_package("emmeans", boolean = TRUE) ||
    !.assert_package("effects", boolean = TRUE) ||
    !.assert_package("marginaleffects", boolean = TRUE) ||
    !.assert_package("ggeffects", boolean = TRUE) ||
    !.assert_package("dplyr", boolean = TRUE) ||
    !.assert_package("scales", boolean = TRUE)
) {
  knitr::opts_chunk$set(eval = FALSE)
}

Terminology

The overall idea of "marginal effects" is too provide tools to better interpret the results of a model by estimating several quantities at the margins. However, it has been implemented in many different ways by different ways and there is a bunch of quasi-synonyms for the idea of "marginal effects": statistical effects, marginal effects, marginal means, contrasts, marginal slopes, conditional effects, conditional marginal effects, marginal effects at the mean, and many other similarly-named ideas.

In {broom.helpers}, we tried to adopt a terminology consistent with the {marginaleffects} package, first released in September 2021, and with Andrew Heiss' Marginalia blog post published in May 2022.

Adjusted Predictions correspond to the outcome predicted by a fitted model on a specified scale for a given combination of values of the predictor variables, such as their observed values, their means, or factor levels (a.k.a. "reference grid"). When prediction are averaged according to a specific regressor, we will then refer to Marginal Predictions.

Marginal Contrasts are referring to a comparison (e.g. difference) of the outcome for a certain regressor, considering "meaningfully" or "typical" values for the other predictors (at the mean/mode, at custom values, averaged over observed values...). Contrasts could be computed for categorical variables (e.g. difference between two specific levels) or for continuous variables (change in the outcome for a certain change of the regressor).

Marginal Effects / Slopes are defined for continuous variables as a partial derivative (slope) of the regression equation with respect to a regressor of interest. Put differently, the marginal effect is the slope of the prediction function, measured at a specific value of the regressor of interest. In scientific practice, the marginal effects fall in the same toolbox as the marginal contrasts.

Marginal Means are adjusted predictions of a model, averaged across a "reference grid" of categorical predictors. They are similar to marginal predictions, but with subtle differences.

{broom.helpers} embed several custom tidiers to compute such quantities and to return a tibble compatible with tidy_plus_plus() and all others {broom.helpers}'s tidy_*() function. Therefore, it is possible to produce nicely formatted tables with gtsummary::tbl_regression() or forest plots with ggstats::ggcoef_model().

Data preparation

Let's consider the trial dataset from the {gtsummary} package and build a logistic regression model with two categorical predictors (trt and stage) and two continuous predictor (marker and age). We will include an interaction between trt and marker and polynomial terms for age (i.e. age and age^2).

library(broom.helpers)
library(gtsummary)
library(dplyr)
d <- trial |>
  filter(complete.cases(response, trt, marker, grade, age))

mod <- glm(
  response ~ trt * marker + stage + poly(age, 2),
  data = d,
  family = binomial
)
mod |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(age = "Age in years")
  ) |>
  bold_labels()

Marginal Predictions

Marginal Predictions at the Mean

A first approach to better understand / interpret the model consists to predict the value of a regressor, on the model scale, at "typical values" of the other regressors. The estimates are therefore easier to interpret, as they are expressed on the the scale of the outcome (here, for a binary logistic regression, as probabilities). The differences observed between the predictions at different modalities will depend only on the "effect" of that regressor as the others regressors will be fixed at the same "typical values". However, all packages do not use the same definition of "typical values".

the {effects}'s approach

The {effects} package offer an effects::Effect() function to compute marginal predictions at typical values. Although the function is named Effect(), the produced estimates are marginal predictions according to the terminology presented at the beginning of this vignette.

library(effects, quietly = TRUE)
e <- Effect("stage", mod)
e
plot(e)

To understand what are the "typical values" used by effects::Effect(), let's have a look at the model matrix generated by the package and used for predictions.

e$model.matrix

The other continuous regressors are set to their observed mean while the other categorical regressors are weighted according to their observed proportions. Somehow, an artificial "averaged" individual is created, of mean age and mean marker level, and being partly receiving Drug A and Drug B. And then, we predict the probability of response if this individual would be in stage T1, T2, T3 or T4.

For a continuous variable, effects::Effect() will consider several values of the regressor (based on the range of observed values) to estimate marginal predictions at these different values.

e2 <- Effect("age", mod)
e2
plot(e2)

The effects::allEffects() will build all marginal predictions of all regressors, taking into account eventual interactions within the model.

allEffects(mod)
plot(allEffects(mod))

It is also possible to generate similar plots with ggeffects::ggeffect(). Please note that ggeffects::ggeffect() will consider, by default, only individual variables from the model and not existing interactions.

mod |>
  ggeffects::ggeffect() |>
  lapply(plot) |>
  patchwork::wrap_plots()

To generate a tibble of these results formatted in a way that it could be use with tidy_plus_plus() and other {broom.helpers}'s tidy_*() helpers, {broom.helpers} provides a tidy_all_effects() tieder.

tidy_all_effects(mod)

It is therefore very easy to produce a nicely formatted table with gtsummary::tbl_regression() or a forest plot with ggstats::ggcoef_model().

mod |>
  tbl_regression(
    tidy_fun = tidy_all_effects,
    estimate_fun = scales::label_percent(accuracy = .1)
  ) |>
  bold_labels()
ggstats::ggcoef_model(
  mod,
  tidy_fun = tidy_all_effects,
  vline = FALSE
)

the {marginaleffects}'s approach at the Mean

The {marginaleffects} package allows to compute marginal predictions "at the mean", i.e. by considering the mean of the other continuous regressors and the mode (i.e. the most frequent observed modality) of categorical regressors. For that, we should call marginaleffects::predictions() with newdata = "mean".

library(marginaleffects)
predictions(
  mod,
  variables = "stage",
  newdata = "mean",
  by = "stage"
)

Four "mean individuals" were generated, with just the value of stage being different from one individual to the other, before predicting the probability of response.

For a continuous variable, predictions will be made, by default, at Tukey's five numbers, i.e. the minimum, the first quartile, the median, the third quartile and the maximum.

predictions(
  mod,
  variables = "age",
  newdata = "mean",
  by = "age"
)

{broom.helpers} provides a global tidier tidy_marginal_predictions() to compute the marginal predictions for each variable or combination of variables before stacking them in a unique tibble. You should specify newdata = "mean" to get marginal predictions at the mean. By default, as effects::allEffects(), it will consider all higher order combinations of variables (as identified with model_list_higher_order_variables()).

mod |>
  model_list_higher_order_variables()
mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_predictions,
    newdata = "mean",
    estimate_fun = scales::label_percent(accuracy = .1),
    label = list(age = "Age in years")
  ) |>
  modify_column_hide("p.value") |>
  bold_labels()

Simply specify variables_list = "no_interaction" to compute marginal predictions for each individual variable without considering existing interactions.

mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_predictions,
    variables_list = "no_interaction",
    newdata = "mean",
    estimate_fun = scales::label_percent(accuracy = .1),
    label = list(age = "Age in years")
  ) |>
  modify_column_hide("p.value") |>
  bold_labels()

{broom.helpers} also include plot_marginal_predictions() to generate a list of plots to visualize all marginal predictions. Use patchwork::wrap_plots() to combine all plots together.

p <- mod |>
  plot_marginal_predictions(newdata = "mean") |>
  patchwork::wrap_plots() &
  ggplot2::scale_y_continuous(
    labels = scales::label_percent(),
    limits = c(-0.2, 1)
  )
p[[2]] <- p[[2]] + ggplot2::xlab("Age in years")
p + patchwork::plot_annotation(
  title = "Marginal Predictions at the Mean"
)
p <- mod |>
  plot_marginal_predictions(
    "no_interaction",
    newdata = "mean"
  ) |>
  patchwork::wrap_plots() &
  ggplot2::scale_y_continuous(
    labels = scales::label_percent(),
    limits = c(-0.2, 1)
  )
p[[4]] <- p[[4]] + ggplot2::xlab("Age in years")
p + patchwork::plot_annotation(
  title = "Marginal Predictions at the Mean"
)

Alternatively, you can use ggstats::ggcoef_model(), using tidy_args to pass arguments to broom.helpers::tidy_marginal_predictions().

ggstats::ggcoef_model(
  mod,
  tidy_fun = tidy_marginal_predictions,
  tidy_args = list(newdata = "mean", variables_list = "no_interaction"),
  vline = FALSE,
  show_p_values = FALSE,
  signif_stars = FALSE,
  significance = NULL,
  variable_labels = c(age = "Age in years")
)

Average Marginal Predictions

Instead of averaging observed values to generate "typical observations" before predicting the outcome, an alternative consists to predict the outcome on the overall observed values before averaging the results.

More precisely, the purpose is to adopt a counterfactual approach. Let's take an example. Let's consider d our observed data used to estimate the model. We can make a copy of this dataset, where all variables would be identical, but considering that all individuals have received Drug A. Similarly, we could generate a dataset where all individuals would have received Drug B.

dA <- d |>
  mutate(trt = "Drug A")
dB <- d |>
  mutate(trt = "Drug B")

We can now predict the outcome for all observations in dA and then compute the average, and similarly with dB.

predict(mod, newdata = dA, type = "response") |> mean()
predict(mod, newdata = dB, type = "response") |> mean()

We, then, obtain Average Marginal Predictions for trt. The same results could be computed with marginaleffects::avg_predictions(). Note that the counterfactual approach corresponds to the default behavior when no value is provided to newdata.

avg_predictions(mod, variables = "trt", by = "trt", type = "response")

Important: since version 0.10.0 of marginaleffects, we had to add type = "response" to get this result: for glm models, predictions are done on the response scale, before being averaged. If type are not specified, predictions will be made on the link scale, before being averaged and then back transformed on the response scale. Thus, the average prediction may not be exactly identical to the average of predictions.

avg_predictions(mod, variables = "trt", by = "trt")
b <- binomial()
predict(mod, newdata = dA, type = "link") |>
  mean() |>
  b$linkinv()
predict(mod, newdata = dB, type = "link") |>
  mean() |>
  b$linkinv()

We can use tidy_marginal_predictions() to get average marginal predictions for all variables and plot_marginal_predictions() for a visual representation.

mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_predictions,
    type = "response",
    variables_list = "no_interaction",
    estimate_fun = scales::label_percent(accuracy = .1),
    label = list(age = "Age in years")
  ) |>
  modify_column_hide("p.value") |>
  bold_labels()
mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_predictions,
    type = "response",
    estimate_fun = scales::label_percent(accuracy = .1),
    label = list(age = "Age in years")
  ) |>
  modify_column_hide("p.value") |>
  bold_labels()
p <- plot_marginal_predictions(mod, type = "response") |>
  patchwork::wrap_plots(ncol = 2) &
  ggplot2::scale_y_continuous(
    labels = scales::label_percent(),
    limits = c(-0.2, 1)
  )
p[[2]] <- p[[2]] + ggplot2::xlab("Age in years")
p + patchwork::plot_annotation(
  title = "Average Marginal Predictions"
)

Marginal Means and Marginal Predictions at Marginal Means

The {emmeans} package adopted, by default, another approach based on marginal means or estimated marginal means (a.k.a. emmeans).

It will consider a grid of predictors with all combinations of the observed modalities of the categorical variables and fixing continuous variables at their means.

Let's call marginaleffects::predictions() with newdata = "marginalmeans".

pred <- predictions(mod, newdata = "marginalmeans")
pred

As we can see, pred contains 8 rows, one for each combination of trt (2 modalities) and stage (4 modalities). age is fixed at its mean (mean(d$age)) as well as marker.

Let's compute the average predictions for each value of stage.

pred |>
  group_by(stage) |>
  summarise(mean(estimate))

We can check that we obtain the same estimates as with emmeans::emmeans().

emmeans::emmeans(mod, "stage", type = "response")

These estimates could be computed, for each categorical variable, with marginaleffects::prediction() using datagrid(grid_type = "balanced")[^1].

[^1]: The function marginaleffects::marginalmeans() is now deprecated.

predictions(mod,
  by = "trt",
  newdata = datagrid(grid_type = "balanced")
)
predictions(mod,
  by = "stage",
  newdata = datagrid(grid_type = "balanced")
)

Marginal means are defined only for categorical variables. However, we can define marginal predictions at marginal means for both continuous and categorical variables, calling tidy_marginal_predictions() with the option newdata = "marginalmeans". For categorical variables, marginal predictions at marginal means will be equal to marginal means.

mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_predictions,
    newdata = "marginalmeans",
    variables_list = "no_interaction",
    estimate_fun = scales::label_percent(accuracy = .1),
    label = list(age = "Age in years")
  ) |>
  modify_column_hide("p.value") |>
  bold_labels()

Alternative approaches

Marginal Predictions at the Median

They are similar to marginal predictions at the mean, except that continuous variables are fixed at the median of observed values (and categorical variables at their mode). Simply use newdata = "median".

mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_predictions,
    newdata = "median",
    variables_list = "no_interaction",
    estimate_fun = scales::label_percent(accuracy = .1),
    label = list(age = "Age in years")
  ) |>
  modify_column_hide("p.value") |>
  bold_labels()

the ggeffects::ggpredict()'s approach

The {ggeffects} package offers a ggeffects::ggpredict() function which generates marginal predictions at the mean of continuous variables and at the first modality (used as reference) of categorical variables. {broom.helpers} provides a tidy_ggpredict() tidier.

mod |>
  tbl_regression(
    tidy_fun = tidy_ggpredict,
    estimate_fun = scales::label_percent(accuracy = .1),
    label = list(age = "Age in years")
  ) |>
  bold_labels()
mod |>
  ggeffects::ggpredict() |>
  plot() |>
  patchwork::wrap_plots()

Marginal Contrasts

Now that we have a way to estimate marginal predictions, we can easily compute marginal contrasts, i.e. difference between marginal predictions.

Average Marginal Contrasts

Let's consider first a categorical variable, e.g. stage. Average Marginal Predictions are obtained with marginaleffects::avg_predictions().

pred <- avg_predictions(mod, variables = "stage", by = "stage", type = "response")
pred

The contrast between "T2" and "T1" is simply the difference between the two adjusted predictions:

pred$estimate[2] - pred$estimate[1]

The marginaleffects::avg_comparisons() function allows to compute all differences between adjusted predictions.

comp <- avg_comparisons(mod, variables = "stage")
comp

Note: in fact, avg_comparisons() has computed the contrasts for each observed values before averaging it. By construction, it is equivalent to the difference of the average marginal predictions.

As the contrast has been averaged over the observed values, we can call them average marginal contrast.

By default, each modality is contrasted with the first one taken as a reference.

avg_comparisons(mod, variables = "stage")

Other types of contrasts could be specified using the variables argument.

avg_comparisons(mod, variables = list(stage = "sequential"))
avg_comparisons(mod, variables = list(stage = "pairwise"))

Let's consider a continuous variable:

avg_comparisons(mod, variables = "age")

By default, marginaleffects::avg_comparisons() computes, for each observed value, the effect of increasing age by one unit (comparing adjusted predictions when the regressor is equal to its observed value minus 0.5 and its observed value plus 0.5). It is possible to compute a contrast for another gap, for example the average difference for an increase of 10 years:

avg_comparisons(mod, variables = list(age = 10))

Contrasts for all individual predictors could be easily obtained:

avg_comparisons(mod)

It should be noted that column names are not consistent with other tidiers used by broom.helpers. Therefore, a comparisons object should not be passed directly to tidy_plus_plus(). Instead, you should use broom.helpers::tidy_avg_comparisons().

tidy_avg_comparisons(mod)

This custom tidier is compatible with tidy_plus_plus() and the suit of other functions provided by {broom.helpers}.

mod |>
  tidy_plus_plus(tidy_fun = tidy_avg_comparisons)

A nicely formatted table can therefore be generated with gtsummary::tbl_regression().

mod |>
  tbl_regression(
    tidy_fun = tidy_avg_comparisons,
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()

Similarly, a forest plot could be produced with ggstats::ggcoef_model().

ggstats::ggcoef_model(
  mod,
  tidy_fun = tidy_avg_comparisons,
  variable_labels = c(age = "Age in years")
) +
  ggplot2::scale_x_continuous(
    labels = scales::label_percent(style_positive = "plus")
  )

Marginal Contrasts at the Mean

Instead of computing contrasts for each observed values before averaging, another approach consist of considering an hypothetical individual whose characteristics correspond to the "average" before predicting results and computing contrasts.

It could be achieved with {marginaleffects} by using newdata = "mean". In that case, it will consider an individual where continuous predictors are equal to the mean of observed values and where categorical predictors will be set to the mode (i.e. most frequent value) of the observed values.

pred <- predictions(mod, variables = "trt", newdata = "mean")
pred
pred$estimate[2] - pred$estimate[1]
comparisons(mod, variables = "trt", newdata = "mean")

The newdata argument can be passed to tidy_avg_comparisons(), tidy_plus_plus or gtsummary::tbl_regression().

mod |>
  tbl_regression(
    tidy_fun = tidy_avg_comparisons,
    newdata = "mean",
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()

For ggstats::ggcoef_model(), use tidy_args to pass newdata = "mean".

mod |>
  ggstats::ggcoef_model(
    tidy_fun = tidy_avg_comparisons,
    tidy_args = list(newdata = "mean"),
    variable_labels = c(age = "Age in years")
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::label_percent(style_positive = "plus")
  )

Alternative approaches

Other assumptions, such as "marginalmeans" or "median", could be defined using newdata. See the documentation of marginaleffects::comparisons().

mod |>
  tbl_regression(
    tidy_fun = tidy_avg_comparisons,
    newdata = "marginalmeans",
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()
mod |>
  ggstats::ggcoef_model(
    tidy_fun = tidy_avg_comparisons,
    tidy_args = list(newdata = "marginalmeans"),
    variable_labels = c(age = "Age in years")
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::label_percent(style_positive = "plus")
  )

Dealing with interactions

In our model, we defined an interaction between trt and marker. Therefore, we could be interested to compute the contrast of marker for each value of trt.

avg_comparisons(
  mod,
  variables = list(marker = 1),
  newdata = datagrid(
    trt = unique,
    grid_type = "counterfactual"
  ),
  by = "trt"
)

Alternatively, it is possible to compute "cross-contrasts" showing what is happening when both marker and trt are changing.

avg_comparisons(
  mod,
  variables = list(marker = 1, trt = "reference"),
  cross = TRUE
)

The tidier tidy_marginal_contrasts() allows to compute directly several combinations of variables and to stack all the results in a unique tibble.

mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_contrasts,
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()
ggstats::ggcoef_model(
  mod,
  tidy_fun = tidy_marginal_contrasts,
  variable_labels = c(age = "Age in years")
)

By default, when there is an interaction, contrasts are computed for the last variable of the interaction according to the different values of the first variables (if one of this variable is continuous, using Tukey's five numbers).

The option variables_list = "cross" could be used to get "cross-contrasts" for interactions.

mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_contrasts,
    variables_list = "cross",
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()

The option variables_list = "no_interaction" could be used to get the average marginal contrasts for each variable without considering interactions.

mod |>
  tbl_regression(
    tidy_fun = tidy_marginal_contrasts,
    variables_list = "no_interaction",
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()
ggstats::ggcoef_model(
  mod,
  tidy_fun = tidy_marginal_contrasts,
  tidy_args = list(variables_list = "no_interaction"),
  variable_labels = c(age = "Age in years")
)

As before, to display marginal contrasts at the mean, indicate newdata = "mean". For more information on the way to customize the combination of variables, see the documentation and examples of tidy_marginal_contrasts().

Marginal Effects / Marginal Slopes

Marginal effects are similar to marginal contrasts with a subtle difference. For a continuous regressor, a marginal contrast could be seen as a difference while a marginal effect is a partial derivative. Put differently, the marginal effect of a continuous regressor $x$ is the slope of the prediction function $y$, measured at a specific value of $x$, i.e. ${\partial y}/{\partial x}$.

Marginal effects are expressed according to the scale of the model and represent the expected change on the outcome for an increase of one unit of the regressor.

By definition, marginal effects are not defined for categorical variables, marginal contrasts being reported instead.

Like marginal contrasts, several approaches exist to compute marginal effects. For more details, see the dedicated vignette of the {marginaleffects} package.

Average Marginal Effects (AME)

A marginal effect will be computed for each observed values before being averaged with marginaleffects::avg_slopes().

avg_slopes(mod)

Column names are not consistent with other tidiers used by {broom.helpers}. Use tidy_avg_slopes() instead.

mod |>
  tbl_regression(
    tidy_fun = tidy_avg_slopes,
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()
mod |>
  ggstats::ggcoef_model(
    tidy_fun = tidy_avg_slopes,
    variable_labels = c(age = "Age in years")
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::label_percent(style_positive = "plus")
  )

Please note that for categorical variables, marginal contrasts are returned.

Same results could be obtained with margins::margins() function inspired by Stata's margins command. As margins::margins() is not compatible with stats::poly(), we will rewrite our model, replacing poly(age, 2) by age + age^2.

mod_alt <- glm(
  response ~ trt * marker + stage + age + age^2,
  data = d,
  family = binomial
)
margins::margins(mod_alt) |> tidy()

For {broom.helpers}, {gtsummary} or {ggstats}, use tidy_margins().

mod_alt |>
  tbl_regression(
    tidy_fun = tidy_margins,
    estimate_fun = scales::label_percent(style_positive = "plus")
  ) |>
  bold_labels()

Marginal Effects at the Mean (MEM)

For marginal effects at the mean[^2], simple use newdata = "mean".

[^2]: More precisely, marginaleffects::marginaleffects() use the mean of continuous variables and the mode of categorical variables.

mod |>
  tbl_regression(
    tidy_fun = tidy_avg_slopes,
    newdata = "mean",
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()
mod |>
  ggstats::ggcoef_model(
    tidy_fun = tidy_avg_slopes,
    tidy_args = list(newdata = "mean"),
    variable_labels = c(age = "Age in years")
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::label_percent(style_positive = "plus")
  )

Marginal Effects at Marginal Means

Simply use newdata = "marginalmeans".

mod |>
  tbl_regression(
    tidy_fun = tidy_avg_slopes,
    newdata = "marginalmeans",
    estimate_fun = scales::label_percent(style_positive = "plus"),
    label = list(age = "Age in years")
  ) |>
  bold_labels()
mod |>
  ggstats::ggcoef_model(
    tidy_fun = tidy_avg_slopes,
    tidy_args = list(newdata = "marginalmeans"),
    variable_labels = c(age = "Age in years")
  ) +
  ggplot2::scale_x_continuous(
    labels = scales::label_percent(style_positive = "plus")
  )

Further readings



larmarange/broom.helpers documentation built on Sept. 27, 2024, 12:35 a.m.