knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(regressinator) library(ggplot2) library(broom)
The basic linear regression model assumes that $$ Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon, $$ where $\epsilon$ has mean zero and variance $\sigma^2$. The regressors $X_1, \dots, X_p$ might be the original predictors, or could be transformations and combinations of them, such as interactions or polynomials. The assumptions on the distribution of $\epsilon$ are used to obtain the variance of $\hat \beta$ and hence are important for inference, while the assumption of a linear relationship determines whether our mean function is accurate.
For these examples, we'll consider a simple multivariate setting where the population relationship is nonlinear and we misspecify the model:
nonlinear_pop <- population( x1 = predictor(runif, min = 1, max = 8), x2 = predictor(runif, min = 4, max = 12), y = response(0.7 + 0.8 * x1**2 + 1.2 * x2, family = gaussian(), error_scale = 4.0) ) nonlinear_data <- sample_x(nonlinear_pop, n = 100) |> sample_y() fit <- lm(y ~ x1 + x2, data = nonlinear_data)
We'll use broom::augment()
to get diagnostic information from the fit in a
standardized data frame, making it easy to use ggplot2 to produce different
diagnostic plots.
Residual plots can detect misspecification of the mean function. When we plot the residuals against any linear combination of the regressors, they should have mean zero and constant variance. If they do not, the model may be incorrect.
One such linear combination is the fitted values $\hat Y$. We can obtain the
residuals and fitted values from broom::augment()
:
augment(fit) |> ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth(se = FALSE) + labs(x = "Fitted value", y = "Residual")
There is a suspicious trend here, but it is hard to tell what specifically is wrong from the model from this. It would be more helpful to know which predictor is modeled incorrectly.
To plot residuals versus the regressors, the regressinator provides
augment_longer()
, which is like augment()
but converts the data to long form
with one row per predictor per observation, making it easy to facet the
residuals:
augment_longer(fit) |> ggplot(aes(x = .predictor_value, y = .resid)) + geom_point() + geom_smooth(se = FALSE) + facet_wrap(vars(.predictor_name), scales = "free_x") + labs(x = "Predictor", y = "Residual")
The trend in the residuals against x1
makes it clear that something is wrong
with our specification of x1
in the model, which is correct: we modeled it as
linear when it should be quadratic.
Detecting misspecification is a matter of judgment, as the residuals are random
and trends can appear even when the model is well-specified. Using
model_lineup()
, we can compare the true residual plots to several where the
model is correctly specified. Each row gives the plots for one simulation, and
one of the fives rows (at random) is the true residual plots:
model_lineup(fit, fn = augment_longer, n = 5) |> ggplot(aes(x = .predictor_value, y = .resid)) + geom_point() + geom_smooth(se = FALSE) + facet_grid(rows = vars(.sample), cols = vars(.predictor_name), scales = "free_x") + labs(x = "Predictor", y = "Residual")
This gives a sense of the variation to be expected when the model is well-specified. It should still be easy to spot the row containing the plots for our misspecified model.
The partial_residuals()
function fetches partial residuals in a convenient
data frame format. Partial residuals are defined in terms of the original
predictors, not the regressors; we can think of a partial residual plot for a
particular predictor as showing the relationship between that predictor and the
response, after "adjusting for" the other predictors according to our fitted
model. See the function documentation for references on the use and
interpretation of partial residuals.
Here the black line gives the fitted predictor effects (i.e. the model estimate
of the relationship), while the blue line smooths the partial residuals. We can
see that for x1
, the blue line deviates systematically from the black line:
partial_residuals(fit) |> ggplot(aes(x = .predictor_value, y = .partial_resid)) + geom_point() + # partial residuals geom_smooth(se = FALSE) + # smoothed residuals geom_line(aes(x = .predictor_value, y = .predictor_effect)) + # effects facet_wrap(vars(.predictor_name), scales = "free") + labs(x = "Predictor", y = "Partial residual")
The shape of the blue line approximates the shape of the true relationship
between x1
and the response (provided the rest of the model is
well-specified!), so we can use its shape to determine how to change our model.
We can again use model_lineup()
to see how these compare to the partial
residuals when the model is correctly specified:
model_lineup(fit, partial_residuals, n = 5) |> ggplot(aes(x = .predictor_value, y = .partial_resid)) + geom_point() + geom_smooth(se = FALSE) + geom_line(aes(x = .predictor_value, y = .predictor_effect)) + facet_grid(rows = vars(.sample), cols = vars(.predictor_name), scales = "free") + labs(x = "Predictor", y = "Partial residual")
In the plots where the model is correctly specified, the blue and black lines coincide, making it easy to spot the misspecified model.
Partial residuals can also be used to detect interaction terms that should be included in the model. For example, let's define a population in which $X_2$ is a factor variable with two levels, and the slope and intercept of the relationship between $X_1$ and $Y$ depend on the value of $X_2$.
intercepts <- c( "foo" = -0.3, "bar" = 1.7 ) slopes <- c( "foo" = 1.8, "bar" = -0.4 ) interact_pop <- population( x1 = predictor(runif, min = 1, max = 8), x2 = predictor(rfactor, levels = c("foo", "bar")), y = response(by_level(x2, intercepts) + by_level(x2, slopes) * x1, family = gaussian(), error_scale = 4.0) ) interact_data <- sample_x(interact_pop, n = 100) |> sample_y() no_interact_fit <- lm(y ~ x1 + x2, data = interact_data)
We have fit a model that does not account for the interaction, allowing a different intercept for the two groups but forcing the slope of $X_1$ to be the same. If we are concerned the interaction may be present, we can make the conventional partial residual plot, but with the points colored by the value of $X_2$:
partial_residuals(no_interact_fit) |> ggplot(aes(x = .predictor_value, y = .partial_resid, color = x2)) + geom_point() + # partial residuals geom_smooth(se = FALSE) + # smoothed residuals geom_line(aes(x = .predictor_value, y = .predictor_effect, color = NULL)) + # effects facet_wrap(vars(.predictor_name), scales = "free") + labs(x = "Predictor", y = "Partial residual")
Notice the very distinct smoothing lines: the foo
group appears to have a much
steeper slope than the bar
group. This is in fact true in the population. This
may alert us that while our average slope is accurate, and the relationship
appears reasonably linear, the slope may differ for the groups. We could, of
course, produce a lineup to assess if the slope difference is significant, but
it appears fairly large here.
We can now fit a model with the interaction term and generate partial residuals again:
interact_fit <- lm(y ~ x1 * x2, data = interact_data) partial_residuals(interact_fit) |> ggplot(aes(x = .predictor_value, y = .partial_resid, color = x2)) + geom_point() + # partial residuals geom_smooth(se = FALSE) + # smoothed residuals geom_line(aes(x = .predictor_value, y = .predictor_effect, color = NULL)) + # effects facet_wrap(vars(.predictor_name), scales = "free") + labs(x = "Predictor", y = "Partial residual")
In this plot, both groups appear to have the same linear trend, and so adding the interaction appears to have been a good idea.
Note the positive slope in this plot. In the calculation of partial residuals
for $X_1$, all other predictors are set to 0 or their baseline level, so $X_2$
is set to foo
. Hence the overall slope is approximately 1.8, even for points
with $X_2 = \text{bar}$, which can be confusing.
The Cook's distance for an observation represents how much the model fitted values would change if that observation were removed, scaled by the model's mean squared error. A Cook's distance of 1 is often considered a cutoff for a highly influential observation. Returning to our original model fit to nonlinear data, we can produce the Cook's distances:
augment(fit) |> ggplot(aes(x = seq_along(.cooksd), y = .cooksd)) + geom_col() + labs(x = "Row index", y = "Cook's distance")
Note that the Cook's distance measures changes in fitted values. If several regressors are collinear, a small change in an observation may change $\hat \beta$ dramatically but not change the fitted values much. (It can also be interpreted as the Mahalanobis distance between the fitted $\hat \beta$ with and without each observation, where the Mahalanobis distance takes into account the variance of $\hat \beta$; collinear regressors will tend to have high variance in their parameter estimates, so while dropping a covariate may change their entries in $\hat \beta$, this will not cause a great increase in the Mahalanobis distance.)
Q-Q plots are used to see the distribution of the standardized residuals, and compare it to a normal distribution:
augment(fit) |> ggplot(aes(sample = .std.resid)) + geom_qq() + geom_qq_line() + labs(title = "Normal Q-Q plot of standardized residuals", x = "Theoretical quantiles", y = "Observed quantiles")
It is difficult for novices to judge normality from a Q-Q plot, so a lineup can be helpful:
model_lineup(fit) |> ggplot(aes(sample = .std.resid)) + geom_qq() + geom_qq_line() + facet_wrap(vars(.sample)) + labs(title = "Normal Q-Q plot of standardized residuals", x = "Theoretical quantiles", y = "Observed quantiles")
The true Q-Q plot does not stand out here, so we have little evidence for systematically non-normal residuals.
Let's illustrate the use of these diagnostics on a real dataset: the Palmer Penguins data, giving measurements of 344 penguins in the Palmer Archipelago in Antarctica.
library(palmerpenguins)
Let's consider building a model of bill length as a function of flipper length and species:
penguin_1 <- lm(bill_length_mm ~ flipper_length_mm + species, data = penguins)
This model allows a different intercept per species, but not a different slope. Let's plot the partial residuals for flipper length:
partial_residuals(penguin_1, flipper_length_mm) |> ggplot(aes(x = .predictor_value, y = .partial_resid)) + geom_point(aes(color = species)) + geom_smooth(aes(color = species), se = FALSE) + geom_line(aes(y = .predictor_effect)) + labs(x = "Flipper length (mm)", y = "Partial residual", color = "Species")
Here we can see that gentoo penguins tend to have longer flippers than Adélie and chinstrap penguins. Looking at the partial residuals in each group, there appear to be varying slopes: there is a consistent upward trend in the residuals for gentoo penguins relative to the overall trend, for instance. This suggests that perhaps there is an interaction, and the slope of the relationship differs for each species.
Let's fit a new model with an interaction.
penguin_2 <- lm(bill_depth_mm ~ flipper_length_mm * species, data = penguins)
Now the residuals do not seem to have separate trends in each group:
partial_residuals(penguin_2, flipper_length_mm) |> ggplot(aes(x = .predictor_value, y = .partial_resid)) + geom_point(aes(color = species)) + geom_smooth(aes(color = species), se = FALSE) + geom_line(aes(y = .predictor_effect)) + labs(x = "Flipper length (mm)", y = "Partial residual", color = "Species")
We can now proceed to other diagnostics, such as checking for equal variance and normality of the residuals. Let's make a Q-Q plot of the standardized residuals for each species from this fit:
augment(penguin_2) |> ggplot(aes(sample = .std.resid)) + geom_qq() + geom_qq_line() + facet_wrap(vars(species)) + labs(title = "Normal Q-Q plot of standardized residuals", x = "Theoretical quantiles", y = "Observed quantiles")
The residuals appear reasonably normally distributed, but notice the slopes are different for each species, indicating they have different variances. The residual variance appears to be largest for Adélie penguins and smallest for gentoo penguins. If we were using this model purely for prediction, this would not be a concern; but if we intend to conduct tests or produce confidence intervals, this may be an issue.
There are no obvious major outliers either in the partial residual plots or residual Q-Q plots, so we have no reason to suspect there are highly influential observations. Nonetheless, we can check the Cook's distances:
augment(penguin_2) |> ggplot(aes(x = seq_along(.cooksd), y = .cooksd)) + geom_col() + labs(x = "Row index", y = "Cook's distance")
This looks fine. Even the most influential point still has a very small Cook's distance.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.