check.resid | R Documentation |
This function performs residual diagnostics for linear models estimated by
using the lm()
function for detecting nonlinearity (partial residual or
component-plus-residual plots), nonconstant error variance (predicted values
vs. residuals plot), and non-normality of residuals (Q-Q plot and histogram
with density plot).
check.resid(model, type = c("linear", "homo", "normal"),
resid = c("unstand", "stand", "student"),
point.shape = 21, point.fill = "gray80", point.size = 1,
line1 = TRUE, line2 = TRUE,
line.type1 = "solid", line.type2 = "dashed",
line.width1 = 1, line.width2 = 1,
line.color1 = "#0072B2", line.color2 = "#D55E00",
bar.width = NULL, bar.n = 30, bar.color = "black",
bar.fill = "gray95", strip.size = 11,
label.size = 10, axis.size = 10,
xlimits = NULL, ylimits = NULL,
xbreaks = ggplot2::waiver(), ybreaks = ggplot2::waiver(),
check = TRUE, plot = TRUE)
model |
a fitted model of class |
type |
a character string specifying the type of the plot, i.e.,
|
resid |
a character string specifying the type of residual used for
the partial (component-plus-residual) plots or Q-Q plot and
histogram, i.e., |
point.shape |
a numeric value for specifying the argument |
point.fill |
a numeric value for specifying the argument |
point.size |
a numeric value for specifying the argument |
line1 |
logical: if |
line2 |
logical: if |
line.type1 |
a character string or numeric value for specifying the argument
|
line.type2 |
a character string or numeric value for specifying the argument
|
line.width1 |
a numeric value for specifying the argument |
line.width2 |
a numeric value for specifying the argument |
line.color1 |
a character string or numeric value for specifying the argument
|
line.color2 |
a character string or numeric value for specifying the argument
|
bar.width |
a numeric value for specifying the argument |
bar.n |
a numeric value for specifying the argument |
bar.color |
a character string or numeric value for specifying the argument
|
bar.fill |
a character string or numeric value for specifying the argument
|
strip.size |
a numeric value for specifying the argument |
label.size |
a numeric value for specifying the argument |
axis.size |
a numeric value for specifying the argument |
xlimits |
a numeric value for specifying the argument |
ylimits |
a numeric value for specifying the argument |
xbreaks |
a numeric value for specifying the argument |
ybreaks |
a numeric value for specifying the argument |
check |
logical: if |
plot |
logical: if |
The violation of the assumption of linearity
implies that the model cannot accurately capture the systematic pattern of the
relationship between the outcome and predictor variables. In other words, the
specified regression surface does not accurately represent the relationship
between the conditional mean values of Y
and the X
s. That means
the average error E(\varepsilon)
is not 0 at every point on the regression
surface (Fox, 2015).
In multiple regression, plotting the outcome variable Y
against each predictor
variable X
can be misleading because it does not reflect the partial
relationship between Y
and X
(i.e., statistically controlling for
the other X
s), but rather the marginal relationship between Y
and
X
(i.e., ignoring the other X
s). Partial residual plots or
component-plus-residual plots should be used to detect nonlinearity in multiple
regression. The partial residual for the j
th predictor variable is defined
as
e_i^{(j)} = b_jX_{ij} + e_i
The linear component of the partial relationship between Y
and X_j
is added back to the least-squares residuals, which may include an unmodeled
nonlinear component. Then, the partial residual e_i^{(j)}
is plotted
against the predictor variable X_j
. Nonlinearity may become apparent when
a non-parametric regression smoother is applied.
By default, the function plots each predictor against the partial residuals, and draws the linear regression and the loess smooth line to the partial residual plots.
The violation of the assumption of constant error variance, often referred to as heteroscedasticity, implies that the variance of the outcome variable around the regression surface is not the same at every point on the regression surface (Fox, 2015).
Plotting residuals against the outcome variable Y
instead of the predicted
values \hat{Y}
is not recommended because Y = \hat{Y} + e
. Consequently,
the linear correlation between the outcome variable Y
and the residuals
e
is \sqrt{1 - R^2}
where R
is the multiple correlation coefficient.
In contrast, plotting residuals against the predicted values \hat{Y}
is
much easier to examine for evidence of nonconstant error variance as the correlation
between \hat{Y}
and e
is 0. Note that the least-squares residuals
generally have unequal variance Var(e_i) = \sigma^2 / (1 - h_i)
where
h
is the leverage of observation i
, even if errors have constant
variance \sigma^2
. The studentized residuals e^*_i
, however, have
a constant variance under the assumption of the regression model. Residuals
are studentized by dividing them by \sigma^2_i(\sqrt{(1 - h_i)}
where
\sigma^2_i
is the estimate of \sigma^2
obtained after deleting the
i
th observation, and h_i
is the leverage of observation i
(Meuleman et al, 2015).
By default, the function plots the predicted values against the studentized residuals. It also draws a horizontal line at 0, a loess smooth lines for all residuals as well as separate loess smooth lines for positive and negative residuals.
Statistical inference under the violation of the assumption of normally distributed errors is approximately valid in all but small samples. However, the efficiency of least squares is not robust because the least-squares estimator is the most efficient and unbiased estimator only when the errors are normally distributed. For instance, when error distributions have heavy tails, the least-squares estimator becomes much less efficient compared to robust estimators. In addition, error distributions with heavy-tails result in outliers and compromise the interpretation of conditional means because the mean is not an accurate measure of central tendency in a highly skewed distribution. Moreover, a multimodal error distribution suggests the omission of one or more discrete explanatory variables that naturally divide the data into groups (Fox, 2016).
By default, the function plots a Q-Q plot of the unstandardized residuals, and
a histogram of the unstandardized residuals and a density plot. Note that
studentized residuals follow a t
-distribution with n - k - 2
degrees
of freedom where n
is the sample size and k
is the number of predictors.
However, the normal and t
-distribution are nearly identical unless the
sample size is small. Moreover, even if the model is correct, the studentized
residuals are not an independent random sample from t_{n - k - 2}
. Residuals
are correlated with each other depending on the configuration of the predictor
values. The correlation is generally negligible unless the sample size is small.
Returns an object of class misty.object
, which is a list with following
entries:
call |
function call |
type |
type of analysis |
model |
model specified in |
plotdat |
data frame used for the plot |
args |
specification of function arguments |
plot |
ggplot2 object for plotting the residuals |
Takuya Yanagida takuya.yanagida@univie.ac.at
Fox, J. (2016). Applied regression analysis and generalized linear models (3rd ed.). Sage Publications, Inc.
Meuleman, B., Loosveldt, G., & Emonds, V. (2015). Regression analysis: Assumptions and diagnostics. In H. Best & C. Wolf (Eds.), The SAGE handbook of regression analysis and causal inference (pp. 83-110). Sage.
check.collin
, check.outlier
## Not run:
#-------------------------------------------------------------------------------
# Residual diagnostics for a linear model
mod <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
# Example 1: Partial (component-plus-residual) plots
check.resid(mod, type = "linear")
# Example 2: Predicted values vs. residuals plot
check.resid(mod, type = "homo")
# Example 3: Q-Q plot and histogram with density plot
check.resid(mod, type = "normal")
#-------------------------------------------------------------------------------
# Extract data and ggplot2 object
object <- check.resid(mod, type = "linear", plot = FALSE)
# Data frame
object$plotdat
# ggplot object
object$plot
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.