estimate_expectation: Model-based predictions
In easystats/estimate: Estimation of Model-Based Predictions, Contrasts and Means

estimate_expectation

R Documentation

Model-based predictions

Description

After fitting a model, it is useful generate model-based estimates of the response variables for different combinations of predictor values. Such estimates can be used to make inferences about relationships between variables, to make predictions about individual cases, or to compare the predicted values against the observed data.

The modelbased package includes 4 "related" functions, that mostly differ in their default arguments (in particular, data and predict):

estimate_prediction(data = NULL, predict = "prediction", ...)
estimate_expectation(data = NULL, predict = "expectation", ...)
estimate_relation(data = "grid", predict = "expectation", ...)
estimate_link(data = "grid", predict = "link", ...)

While they are all based on model-based predictions (using insight::get_predicted()), they differ in terms of the type of predictions they make by default. For instance, estimate_prediction() and estimate_expectation() return predictions for the original data used to fit the model, while estimate_relation() and estimate_link() return predictions on a insight::get_datagrid(). Similarly, estimate_link returns predictions on the link scale, while the others return predictions on the response scale. Note that the relevance of these differences depends on the model family (for instance, for linear models, estimate_relation is equivalent to estimate_link(), since there is no difference between the link-scale and the response scale).

Note that you can run plot() on the output of these functions to get some visual insights (see the plotting examples).

See the details section below for details about the different possibilities.

Usage

estimate_expectation(
  model,
  data = NULL,
  by = NULL,
  predict = "expectation",
  ci = 0.95,
  transform = NULL,
  keep_iterations = FALSE,
  ...
)

estimate_link(
  model,
  data = "grid",
  by = NULL,
  predict = "link",
  ci = 0.95,
  transform = NULL,
  keep_iterations = FALSE,
  ...
)

estimate_prediction(
  model,
  data = NULL,
  by = NULL,
  predict = "prediction",
  ci = 0.95,
  transform = NULL,
  keep_iterations = FALSE,
  ...
)

estimate_relation(
  model,
  data = "grid",
  by = NULL,
  predict = "expectation",
  ci = 0.95,
  transform = NULL,
  keep_iterations = FALSE,
  ...
)

Arguments

`model`	A statistical model.
`data`	A data frame with model's predictors to estimate the response. If `NULL`, the model's data is used. If `"grid"`, the model matrix is obtained (through `insight::get_datagrid()`).
`by`	The predictor variable(s) at which to estimate the response. Other predictors of the model that are not included here will be set to their mean value (for numeric predictors), reference level (for factors) or mode (other types). The `by` argument will be used to create a data grid via `insight::get_datagrid()`, which will then be used as `data` argument. Thus, you cannot specify both `data` and `by` but only of these two arguments.
`predict`	This parameter controls what is predicted (and gets internally passed to `insight::get_predicted()`). In most cases, you don't need to care about it: it is changed automatically according to the different predicting functions (i.e., `estimate_expectation()`, `estimate_prediction()`, `estimate_link()` or `estimate_relation()`). The only time you might be interested in manually changing it is to estimate other distributional parameters (called "dpar" in other packages) - for instance when using complex formulae in `brms` models. The `predict` argument can then be set to the parameter you want to estimate, for instance `"sigma"`, `"kappa"`, etc. Note that the distinction between `"expectation"`, `"link"` and `"prediction"` does not then apply (as you are directly predicting the value of some distributional parameter), and the corresponding functions will then only differ in the default value of their `data` argument.
`ci`	Confidence Interval (CI) level. Default to `0.95` (`⁠95%⁠`).
`transform`	A function applied to predictions and confidence intervals to (back-) transform results, which can be useful in case the regression model has a transformed response variable (e.g., `lm(log(y) ~ x)`). Can also be `TRUE`, in which case `insight::get_transformation()` is called to determine the appropriate transformation-function. Note that no standard errors are returned when transformations are applied.
`keep_iterations`	If `TRUE`, will keep all iterations (draws) of bootstrapped or Bayesian models. They will be added as additional columns named `iter_1`, `iter_2`, and so on. If `keep_iterations` is a positive number, only as many columns as indicated in `keep_iterations` will be added to the output. You can reshape them to a long format by running `bayestestR::reshape_iterations()`.
`...`	You can add all the additional control arguments from `insight::get_datagrid()` (used when `data = "grid"`) and `insight::get_predicted()`.

Value

A data frame of predicted values and uncertainty intervals, with class "estimate_predicted". Methods for visualisation_recipe() and plot() are available.

Expected (average) values

The most important way that various types of response estimates differ is in terms of what quantity is being estimated and the meaning of the uncertainty intervals. The major choices are expected values for uncertainty in the regression line and predicted values for uncertainty in the individual case predictions.

Expected values refer to the fitted regression line - the estimated average response value (i.e., the "expectation") for individuals with specific predictor values. For example, in a linear model y = 2 + 3x + 4z + e, the estimated average y for individuals with x = 1 and z = 2 is 11.

For expected values, uncertainty intervals refer to uncertainty in the estimated conditional average (where might the true regression line actually fall)? Uncertainty intervals for expected values are also called "confidence intervals".

Expected values and their uncertainty intervals are useful for describing the relationship between variables and for describing how precisely a model has been estimated.

For generalized linear models, expected values are reported on one of two scales:

The link scale refers to scale of the fitted regression line, after transformation by the link function. For example, for a logistic regression (logit binomial) model, the link scale gives expected log-odds. For a log-link Poisson model, the link scale gives the expected log-count.
The response scale refers to the original scale of the response variable (i.e., without any link function transformation). Expected values on the link scale are back-transformed to the original response variable metric (e.g., expected probabilities for binomial models, expected counts for Poisson models).

Individual case predictions

In contrast to expected values, predicted values refer to predictions for individual cases. Predicted values are also called "posterior predictions" or "posterior predictive draws".

For predicted values, uncertainty intervals refer to uncertainty in the individual response values for each case (where might any single case actually fall)? Uncertainty intervals for predicted values are also called "prediction intervals" or "posterior predictive intervals".

Predicted values and their uncertainty intervals are useful for forecasting the range of values that might be observed in new data, for making decisions about individual cases, and for checking if model predictions are reasonable ("posterior predictive checks").

Predicted values and intervals are always on the scale of the original response variable (not the link scale).

Functions for estimating predicted values and uncertainty

modelbased provides 4 functions for generating model-based response estimates and their uncertainty:

estimate_expectation():
- Generates expected values (conditional average) on the response scale.
- The uncertainty interval is a confidence interval.
- By default, values are computed using the data used to fit the model.
estimate_link():
- Generates expected values (conditional average) on the link scale.
- The uncertainty interval is a confidence interval.
- By default, values are computed using a reference grid spanning the observed range of predictor values (see insight::get_datagrid()).
estimate_prediction():
- Generates predicted values (for individual cases) on the response scale.
- The uncertainty interval is a prediction interval.
- By default, values are computed using the data used to fit the model.
estimate_relation():
- Like estimate_expectation().
- Useful for visualizing a model.
- Generates expected values (conditional average) on the response scale.
- The uncertainty interval is a confidence interval.
- By default, values are computed using a reference grid spanning the observed range of predictor values (see insight::get_datagrid()).

Data for predictions

If the data = NULL, values are estimated using the data used to fit the model. If data = "grid", values are computed using a reference grid spanning the observed range of predictor values with insight::get_datagrid(). This can be useful for model visualization. The number of predictor values used for each variable can be controlled with the length argument. data can also be a data frame containing columns with names matching the model frame (see insight::get_data()). This can be used to generate model predictions for specific combinations of predictor values.

Note

These functions are built on top of insight::get_predicted() and correspond to different specifications of its parameters. It may be useful to read its documentation, in particular the description of the predict argument for additional details on the difference between expected vs. predicted values and link vs. response scales.

Additional control parameters can be used to control results from insight::get_datagrid() (when data = "grid") and from insight::get_predicted() (the function used internally to compute predictions).

For plotting, check the examples in visualisation_recipe(). Also check out the Vignettes and README examples for various examples, tutorials and usecases.

Examples


library(modelbased)

# Linear Models
model <- lm(mpg ~ wt, data = mtcars)

# Get predicted and prediction interval (see insight::get_predicted)
estimate_expectation(model)

# Get expected values with confidence interval
pred <- estimate_relation(model)
pred

# Visualisation (see visualisation_recipe())
plot(pred)

# Standardize predictions
pred <- estimate_relation(lm(mpg ~ wt + am, data = mtcars))
z <- standardize(pred, include_response = FALSE)
z
unstandardize(z, include_response = FALSE)

# Logistic Models
model <- glm(vs ~ wt, data = mtcars, family = "binomial")
estimate_expectation(model)
estimate_relation(model)

# Mixed models
model <- lme4::lmer(mpg ~ wt + (1 | gear), data = mtcars)
estimate_expectation(model)
estimate_relation(model)

# Bayesian models

model <- suppressWarnings(rstanarm::stan_glm(
  mpg ~ wt,
  data = mtcars, refresh = 0, iter = 200
))
estimate_expectation(model)
estimate_relation(model)

easystats/estimate documentation built on April 5, 2025, 1:36 p.m.