library(LM2GLMM) library(lmtest) knitr::opts_chunk$set(cache = TRUE, cache.path = "./cache_knitr/LM_assumptions/", fig.path = "./fig_knitr/LM_assumptions/", fig.align = "center", fig.width = 4, fig.asp = 1) options(width = 200) set.seed(1L)
r .emo("goal")
r .emo("info")
r .emo("info")
The mean of the response variable must be a linear combination of the parameters and the predictor variables, with no systematic dependence on any omitted terms.
Departure from linearity can originate from a multitude of reasons and can create all kinds of problems.
r .emo("practice")
r .emo("nerd")
But using this method is not advised as results can be unreliable...
r .emo("info")
set.seed(1L) Alien0 <- data.frame(humans_eaten = sample(1:100)) Alien0$size <- 10 + 50 * Alien0$humans_eaten - 0.02 * (Alien0$humans_eaten^2) + rnorm(100, sd = 5) mod0a <- lm(size ~ humans_eaten, data = Alien0) coef(mod0a)
r .emo("info")
plot(size ~ humans_eaten, data = Alien0, pch = 3, col = "red") abline(mod0a, col = "blue", lwd = 2)
r .emo("practice")
Detection: plot the residuals against each quantitative variable
plot(residuals(mod0a) ~ model.matrix(mod0a)[, "humans_eaten"]) abline(h = 0, col = "red", lty = 2)
r .emo("info")
mod0b <- lm(size ~ poly(humans_eaten, 2, raw = TRUE), data = Alien0) summary(mod0b)$coef
Note: now that the model formula is correct we find estimates that are sensible.
r .emo("info")
poison$treat <- factor(poison$Treatment) poison$poison <- factor(poison$Poison) fit_poison <- lm(Time ~ poison + treat, data = poison) plot(residuals(fit_poison) ~ fitted(fit_poison), xlab = "fitted values", ylab = "residuals") abline(h = 0, col = "red", lty = 2)
Note: plotting the residuals against the fitted values is a also a good way to find out about the issue.
r .emo("info")
r .emo("info")
E.g. here no quantitative predictors.
r .emo("practice")
car::boxCox(fit_poison) ## makes profile of logLik as a function of lambda
r .emo("practice")
summary(bc <- car::powerTransform(fit_poison)) car::testTransform(bc, lambda = -1) ## specifically compare estimated lambda to -1 (inverse transformation)
Note: we will consider -1 instead of r round(bc$lambda[[1]], 2)
as it is close enough and easier to interpret!
r .emo("practice")
fit_poison_bc <- update(fit_poison, car::bcPower(Time, lambda = -1) ~ .) plot(residuals(fit_poison_bc) ~ fitted(fit_poison_bc), xlab = "fitted values", ylab = "residuals") abline(h = 0, col = "red", lty = 2)
Note: that looks better r .emo("party")
r .emo("practice")
data.for.pred <- expand.grid(treat = levels(poison$treat), poison = levels(poison$poison)) (pred <- cbind(data.for.pred, predict(fit_poison_bc, newdata = data.for.pred, interval = "confidence")))
These predicted mean response values are expressed in the Box-Cox scale. In this case, it represents the survival rate (see next slide), but you can always get back to the original scale if you need to:
lambda <- -1; (pred$fit * lambda + 1)^(1/lambda)
r .emo("info")
Relationship between the Box-Cox with $\lambda = -1$ and survival (in discrete time):
With $t$ the time to death and $s$ the survival rate, we have:
$$
\begin{array}
\texttt{new.y} & = & \frac{t^{-1} - 1}{-1}\
& = & 1- t^{-1}\
& = & 1- \frac{1}{t}\
& = & s
\end{array}
$$
Proof r .emo("proof")
: if the probability of survival is 0.75, how long do individuals live?
$$t = \frac{1}{1-s} = \frac{1}{1 - 0.75} = \frac{1}{0.25} = 4$$ Or by simulation:
set.seed(123) mean(replicate(10000, which(rbinom(n = 100, size = 1, prob = 0.25) == 1)[1]))
r .emo("info")
summary(fit_poison) ## original (mispecified) model
r .emo("info")
summary(fit_poison_bc) ## BoxCoxed model
r .emo("info")
it does not always solve the issues (e.g. if count data, GLM should be best)
it does not work in the presence of null or negative values
r
data$y <- abs(min(data$y)) + 1
car::bcnPower()
)r .emo("info")
The design matrix must have full rank. That means that the number of parameters to be estimated must be equal to the number of independent columns (i.e. rank) of the design matrix.
Caused by having less data than parameters or by linear dependence between the column vectors of the design matrix. In such case, some parameters cannot be computed.
n < p
r .emo("info")
set.seed(1L) n <- 3 Alien <- data.frame(humans_eaten = 1:n, flowers_eaten = round(runif(n, min = 1, max = 15)), cactus_eaten = round(runif(n, min = 1, max = 10))) Alien$size <- rnorm(n = nrow(Alien), mean = 50 + 0.2 * Alien$humans_eaten + 0.9 * Alien$flowers_eaten + 0.1 * Alien$cactus_eaten, sd = sqrt(25)) fit_alien1a <- lm(size ~ cactus_eaten + humans_eaten + flowers_eaten, data = Alien) coef(fit_alien1a) fit_alien1b <- lm(size ~ humans_eaten + flowers_eaten + cactus_eaten, data = Alien) coef(fit_alien1b)
r .emo("practice")
set.seed(1L) Alien2 <- simulate_Aliens() Alien2$half_humans_eaten <- 0.5 * Alien2$humans_eaten fit_alien2 <- lm(size ~ humans_eaten + half_humans_eaten, data = Alien2) coef(fit_alien2) det(crossprod(model.matrix(fit_alien2))) ## when det(XTX) <= 0, XTX has no inverse! fit_alien2$rank == ncol(model.matrix(fit_alien2))
r .emo("practice")
set.seed(1L) Alien3 <- data.frame(humans_eaten = 1:12, flowers_eaten = round(runif(12, min = 1, max = 15)), cactus_eaten = 0) Alien3$food_units <- 1.2*Alien3$humans_eaten + 0.6*Alien3$flowers_eaten Alien3$size <- rnorm(n = 12, mean = 50 + 1*Alien3$food_units, sd = sqrt(25)) fit_alien3 <- lm(size ~ food_units + humans_eaten + flowers_eaten + cactus_eaten, data = Alien3) coef(fit_alien3) caret::findLinearCombos(model.matrix(fit_alien3)) ## Tip: help to see what creates the issue
r .emo("info")
summary(fit_US <- lm(Rape ~ Assault + Murder, data = USArrests))$coef summary(fit_US2 <- lm(Rape ~ Murder, data = USArrests))$coef
Note: it is a problem to infer causal inference, but not so much for predictions
r .emo("info")
Because of strong correlation(s) between the regressors:
pairs(USArrests)
r .emo("practice")
cor(model.matrix(fit_US)) ## direct measure of correlation in the design matrix cov2cor(vcov(fit_US)) ## direct measure of correlation between parameter estimates
Note: in more complex models, the numbers do not necessarily match, so it is good practice to check both matrices.
r .emo("practice")
car::vif(fit_US) R2 <- summary(lm(Assault ~ Murder, data = USArrests))$r.squared ## works too if more variables 1/(1 - R2)
Notes:
fit_US
was rape
)r .emo("practice")
pca <- prcomp(~ Assault + Murder, data = USArrests, scale. = TRUE) USArrests$PC1 <- pca$x[, 1] summary(fit_US3 <- lm(Rape ~ PC1, data = USArrests))
r .emo("info")
The dependent variable are represented by fixed values.
The presence of measurement errors is the main cause of violation. Violation can trigger both estimates and tests to be biased.
r .emo("info")
set.seed(1L) Alien4 <- simulate_Aliens(100) summary(lm(size ~ humans_eaten, data = Alien4))$coef Alien4$humans_eaten_err <- Alien4$humans_eaten + rnorm(nrow(Alien4), sd = 10) summary(lm(size ~ humans_eaten_err, data = Alien4))$coef
sem
r .emo("practice")
Alien5 <- Alien4 Alien5$human_eaten <- NULL ## we remove the original variable eqns <- sem::specifyEquations(text = " size = alpha*Intercept + slope*humans_eaten humans_eaten = 1*humans_eaten_err V(size) = sigma V(humans_eaten) = 1 V(humans_eaten_err) = 1 ") fit_sem <- sem::sem(eqns, data = Alien5, raw = TRUE, fixed.x = "Intercept") summary(fit_sem, analytic.se = FALSE)$coef ## use analytic.se = FALSE otherwise considers variance known
Note: other solutions are possible using mixed models.
r .emo("info")
The errors (not the residuals) are uncorrelated: $\text{cov}(\epsilon_i, \epsilon_j) = 0$, with $i \neq j$.
A lack of independence (serial autocorrelation) in the errors can appear if there is a departure from linearity, if data have been sampled non-randomly (e.g. spatial or temporal series), or if there is an overarching structure (e.g. repeated measures within individuals, families, species, ...). The lack of independence increases the risk of false positive (sometimes a lot).
r .emo("practice")
We can use the Durbin-Watson test: D-W [0; 4]
set.seed(1L) car::durbinWatsonTest(modConv <- lm(fconvict ~ tfr + partic + degrees + mconvict, data = Hartnagel), max.lag = 3)
r .emo("practice")
We can also compute the partial autocorrelations for the residuals series,
pacf(residuals(modConv))
Note: mind that the CI plotted here is very approximative.
r .emo("practice")
lmtest::dwtest(modConv, order.by = modConv$model$degrees)
Note: as for investigating linearity it is good here to try to order by:
r .emo("practice")
plot(residuals(modConv) ~ fitted(modConv)) abline(h = 0, lty = 2, col = "red")
r .emo("practice")
plot(residuals(modConv) ~ Hartnagel$year, type = "o") abline(h = 0, lty = 2, col = "red")
r .emo("info")
The variance of the error (not residuals) is constant: $\text{var}(\epsilon_j) = \sigma^2$ for all $j$. With matrix notation: if $\epsilon^\text{T}$ is the vector of all $\epsilon_j$, then we assume $\text{cov}(\epsilon, \epsilon) = \sigma^2I_n$, where $I_n$ is the $n \times n$ identity matrix.
Heteros(c/k)edasticity can emerge when there is a mean - variance relationship, when there is non independence between observations, when reaction norm changes acording to the treatement. It can create both false positives and false negative.
r .emo("practice")
set.seed(1L) Alien5 <- simulate_Aliens(N = 100) Alien5$eggs <- rpois(100, lambda = 2 + 1 * Alien5$humans_eaten) ## data generation is NOT gaussian! fit_alien5 <- lm(eggs ~ humans_eaten, data = Alien5) bptest(fit_alien5)
Notes:
r .emo("practice")
Residuals must be standardized as raw residuals always have some (minor) heteroscedasticity.
plot(abs(rstandard(fit_alien5)) ~ fitted(fit_alien5))
Note: here again it is good to plot against all quantitative predictors, fitted values, and as the data come.
r .emo("nerd")
vcov(fit_alien5) car::hccm(fit_alien5) ## correct the covariance matrix of parameter estimates estimates <- coef(fit_alien5) std.errors <- sqrt(diag(car::hccm(fit_alien5))) t.values <- estimates/std.errors p.values <- 2*pt(abs(t.values), df = fit_alien5$df.residual, lower.tail = FALSE) cbind(estimates, std.errors, t.values, p.values)
r .emo("nerd")
Same using Anova
:
car::Anova(fit_alien5, white.adjust = TRUE) ## vcov = hccm 35.2490945224257^2 ## t^2 from previous slide
r .emo("info")
The errors (not the residuals) should be normally distributed.
The distribution of residuals can be skewed, this is often caused by the presence of outliers, and/or when the process generating the data is very different from normal (e.g. Poisson, Binomial...).
r .emo("practice")
Example: the Lilliefors (Kolmogorov-Smirnov) test for normality, Shapiro-Wilk Normality Test...
nortest::lillie.test(residuals(fit_poison)) ## stat = 0 when normal shapiro.test(residuals(fit_poison)) ## stat = 1 when normal
r .emo("practice")
r .emo("party")
par(mfrow = c(2, 2)) plot(fit_poison)
r .emo("info")
They are observations that seem not to belong to the others.
A few very deviant points can strongly influence all your estimations.
It depends... but never trash them blindly.
r .emo("warn")
If you have very good reasons to take them out, do mention it in the paper!
r .emo("practice")
Note: to identify a given point on the plot, call with(Davis, identify(height, weight, row.names(Davis)))
, then click close to the point and press escape.
r .emo("info")
influence.measures(fit_davis) ## In the next slides we will see all that in details!
r .emo("info")
A regression outlier is an observation that has an unusual value of the dependent variable Y, conditional on X.
Regression outliers may not look like an outlier on any Y or X variables.
The leverage quantifies how unusual one row of the design matrix is.
A high leverage is not necessarily a bad thing.
An observation is influential if it strongly influences the predicted values.
A high influence is not necessarily a bad thing.
We do not want high influence caused by high leverage!
r .emo("practice")
This is done by extracting the diagonal element of the hat matrix: the hat values.
Recall: $\widehat{Y} = HY$
head(sort(hatvalues(fit_davis), decreasing = TRUE))
Note r .emo("nerd")
: for some computations and plots the leverage are rescaled as $\frac{h_{i,i}}{1-h_{i,i}}$, but it does not change the reasoning.
r .emo("practice")
plot(fit_davis, which = 5)
r .emo("practice")
head(sort(dffits(fit_davis), decreasing = TRUE)) head(sort(cooks.distance(fit_davis), decreasing = TRUE))
Notes:
r .emo("nerd")
plot(dffits(fit_davis2), cooks.distance(fit_davis2), xlab = "DFFITS", ylab = "Cook's distance")
r .emo("practice")
par(mfrow = c(1, 3)) plot(fit_davis, which = 4:6)
r .emo("practice")
fit_UK <- lm(height ~ sex * milk, data = UK[1:20, ]) par(mfrow = c(1, 3)) plot(fit_UK, which = 4:6)
r .emo("practice")
fit_UK2 <- lm(height ~ sex * milk, data = UK) par(mfrow = c(1, 3)) plot(fit_UK2, which = 4:6)
r .emo("practice")
head(dfbeta(fit_davis), n = 3) coef(fit_davis) - coef(update(fit_davis, data = Davis[-1, ])) head(dfbetas(fit_davis), n = 3) ## same in SE units of the coef
r .emo("practice")
head(sort(covratio(fit_davis), decreasing = TRUE)) det(vcov(update(fit_davis, data = Davis[-19, ]))) / det(vcov(fit_davis))
r .emo("practice")
Note: stars flag candidate outliers, but do check them carefully since it may almost always flag observations even if they are not problematic for your analysis..
influence.measures(fit_davis) ## stars are just there to attract your attention, there is no proper tests!
r .emo("practice")
Tip: if your dataset is large, you can select the rows flagged with the stars as follows:
influence_results <- influence.measures(fit_davis) influence_results$infmat[rowSums(influence_results$is.inf) > 0, , drop = FALSE]
Note: drop = FALSE
will avoid the output to be automatically turned into a vector if there is only one flagged outlier. Such a vector would have lost the name of the row and with it the identity of the outlier.
r .emo("goal")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.