View source: R/estimate_means.R
estimate_means | R Documentation |
Estimate average value of response variable at each factor level or
representative value, respectively at values defined in a "data grid" or
"reference grid". For plotting, check the examples in
visualisation_recipe()
. See also other related functions such as
estimate_contrasts()
and estimate_slopes()
.
estimate_means(
model,
by = "auto",
predict = NULL,
ci = 0.95,
estimate = NULL,
transform = NULL,
keep_iterations = FALSE,
backend = NULL,
verbose = TRUE,
...
)
model |
A statistical model. |
by |
The (focal) predictor variable(s) at which to evaluate the desired
effect / mean / contrasts. Other predictors of the model that are not
included here will be collapsed and "averaged" over (the effect will be
estimated across them). |
predict |
Is passed to the
See also section Predictions on different scales. |
ci |
Confidence Interval (CI) level. Default to |
estimate |
The
You can set a default option for the |
transform |
A function applied to predictions and confidence intervals
to (back-) transform results, which can be useful in case the regression
model has a transformed response variable (e.g., |
keep_iterations |
If |
backend |
Whether to use Another difference is that You can set a default backend via |
verbose |
Use |
... |
Other arguments passed, for instance, to
|
The estimate_slopes()
, estimate_means()
and estimate_contrasts()
functions are forming a group, as they are all based on marginal
estimations (estimations based on a model). All three are built on the
emmeans or marginaleffects package (depending on the backend
argument), so reading its documentation (for instance emmeans::emmeans()
,
emmeans::emtrends()
or this website) is
recommended to understand the idea behind these types of procedures.
Model-based predictions is the basis for all that follows. Indeed,
the first thing to understand is how models can be used to make predictions
(see estimate_link()
). This corresponds to the predicted response (or
"outcome variable") given specific predictor values of the predictors (i.e.,
given a specific data configuration). This is why the concept of reference grid()
is so important for direct predictions.
Marginal "means", obtained via estimate_means()
, are an extension
of such predictions, allowing to "average" (collapse) some of the predictors,
to obtain the average response value at a specific predictors configuration.
This is typically used when some of the predictors of interest are factors.
Indeed, the parameters of the model will usually give you the intercept value
and then the "effect" of each factor level (how different it is from the
intercept). Marginal means can be used to directly give you the mean value of
the response variable at all the levels of a factor. Moreover, it can also be
used to control, or average over predictors, which is useful in the case of
multiple predictors with or without interactions.
Marginal contrasts, obtained via estimate_contrasts()
, are
themselves at extension of marginal means, in that they allow to investigate
the difference (i.e., the contrast) between the marginal means. This is,
again, often used to get all pairwise differences between all levels of a
factor. It works also for continuous predictors, for instance one could also
be interested in whether the difference at two extremes of a continuous
predictor is significant.
Finally, marginal effects, obtained via estimate_slopes()
, are
different in that their focus is not values on the response variable, but the
model's parameters. The idea is to assess the effect of a predictor at a
specific configuration of the other predictors. This is relevant in the case
of interactions or non-linear relationships, when the effect of a predictor
variable changes depending on the other predictors. Moreover, these effects
can also be "averaged" over other predictors, to get for instance the
"general trend" of a predictor over different factor levels.
Example: Let's imagine the following model lm(y ~ condition * x)
where
condition
is a factor with 3 levels A, B and C and x
a continuous
variable (like age for example). One idea is to see how this model performs,
and compare the actual response y to the one predicted by the model (using
estimate_expectation()
). Another idea is evaluate the average mean at each of
the condition's levels (using estimate_means()
), which can be useful to
visualize them. Another possibility is to evaluate the difference between
these levels (using estimate_contrasts()
). Finally, one could also estimate
the effect of x averaged over all conditions, or instead within each
condition (using [estimate_slopes]
).
A data frame of estimated marginal means.
To define representative values for focal predictors (specified in by
,
contrast
, and trend
), you can use several methods. These values are
internally generated by insight::get_datagrid()
, so consult its
documentation for more details.
You can directly specify values as strings or lists for by
, contrast
,
and trend
.
For numeric focal predictors, use examples like by = "gear = c(4, 8)"
,
by = list(gear = c(4, 8))
or by = "gear = 5:10"
For factor or character predictors, use by = "Species = c('setosa', 'virginica')"
or by = list(Species = c('setosa', 'virginica'))
You can use "shortcuts" within square brackets, such as by = "Sepal.Width = [sd]"
or by = "Sepal.Width = [fivenum]"
For numeric focal predictors, if no representative values are specified,
length
and range
control the number and type of representative values:
length
determines how many equally spaced values are generated.
range
specifies the type of values, like "range"
or "sd"
.
length
and range
apply to all numeric focal predictors.
If you have multiple numeric predictors, length
and range
can accept
multiple elements, one for each predictor.
For integer variables, only values that appear in the data will be included
in the data grid, independent from the length
argument. This behaviour
can be changed by setting protect_integers = FALSE
, which will then treat
integer variables as numerics (and possibly produce fractions).
See also this vignette for some examples.
The predict
argument allows to generate predictions on different scales of
the response variable. The "link"
option does not apply to all models, and
usually not to Gaussian models. "link"
will leave the values on scale of
the linear predictors. "response"
(or NULL
) will transform them on scale
of the response variable. Thus for a logistic model, "link"
will give
estimations expressed in log-odds (probabilities on logit scale) and
"response"
in terms of probabilities.
To predict distributional parameters (called "dpar" in other packages), for
instance when using complex formulae in brms
models, the predict
argument
can take the value of the parameter you want to estimate, for instance
"sigma"
, "kappa"
, etc.
"response"
and "inverse_link"
both return predictions on the response
scale, however, "response"
first calculates predictions on the response
scale for each observation and then aggregates them by groups or levels
defined in by
. "inverse_link"
first calculates predictions on the link
scale for each observation, then aggregates them by groups or levels defined
in by
, and finally back-transforms the predictions to the response scale.
Both approaches have advantages and disadvantages. "response"
usually
produces less biased predictions, but confidence intervals might be outside
reasonable bounds (i.e., for instance can be negative for count data). The
"inverse_link"
approach is more robust in terms of confidence intervals,
but might produce biased predictions. However, you can try to set
bias_correction = TRUE
, to adjust for this bias.
In particular for mixed models, using "response"
is recommended, because
averaging across random effects groups is then more accurate.
modelbased_backend
: options(modelbased_backend = <string>)
will set a
default value for the backend
argument and can be used to set the package
used by default to calculate marginal means. Can be "marginalmeans"
or
"emmeans"
.
modelbased_estimate
: options(modelbased_estimate = <string>)
will
set a default value for the estimate
argument.
Chatton, A. and Rohrer, J.M. 2024. The Causal Cookbook: Recipes for Propensity Scores, G-Computation, and Doubly Robust Standardization. Advances in Methods and Practices in Psychological Science. 2024;7(1). \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1177/25152459241236149")}
Dickerman, Barbra A., and Miguel A. Hernán. 2020. Counterfactual Prediction Is Not Only for Causal Inference. European Journal of Epidemiology 35 (7): 615–17. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s10654-020-00659-8")}
Heiss, A. (2022). Marginal and conditional effects for GLMMs with marginaleffects. Andrew Heiss. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.59350/xwnfm-x1827")}
library(modelbased)
# Frequentist models
# -------------------
model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
estimate_means(model)
# the `length` argument is passed to `insight::get_datagrid()` and modulates
# the number of representative values to return for numeric predictors
estimate_means(model, by = c("Species", "Sepal.Width"), length = 2)
# an alternative way to setup your data grid is specify the values directly
estimate_means(model, by = c("Species", "Sepal.Width = c(2, 4)"))
# or use one of the many predefined "tokens" that help you creating a useful
# data grid - to learn more about creating data grids, see help in
# `?insight::get_datagrid`.
estimate_means(model, by = c("Species", "Sepal.Width = [fivenum]"))
## Not run:
# same for factors: filter by specific levels
estimate_means(model, by = "Species = c('versicolor', 'setosa')")
estimate_means(model, by = c("Species", "Sepal.Width = 0"))
# estimate marginal average of response at values for numeric predictor
estimate_means(model, by = "Sepal.Width", length = 5)
estimate_means(model, by = "Sepal.Width = c(2, 4)")
# or provide the definition of the data grid as list
estimate_means(
model,
by = list(Sepal.Width = c(2, 4), Species = c("versicolor", "setosa"))
)
# Methods that can be applied to it:
means <- estimate_means(model, by = c("Species", "Sepal.Width = 0"))
plot(means) # which runs visualisation_recipe()
standardize(means)
# grids for numeric predictors, combine range and length
model <- lm(Sepal.Length ~ Sepal.Width * Petal.Length, data = iris)
# create a "grid": value range for first numeric predictor, mean +/-1 SD
# for remaining numeric predictors.
estimate_means(model, c("Sepal.Width", "Petal.Length"), range = "grid")
# range from minimum to maximum spread over four values,
# and mean +/- 1 SD (a total of three values)
estimate_means(
model,
by = c("Sepal.Width", "Petal.Length"),
range = c("range", "sd"),
length = c(4, 3)
)
data <- iris
data$Petal.Length_factor <- ifelse(data$Petal.Length < 4.2, "A", "B")
model <- lme4::lmer(
Petal.Length ~ Sepal.Width + Species + (1 | Petal.Length_factor),
data = data
)
estimate_means(model)
estimate_means(model, by = "Sepal.Width", length = 3)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.