library(LM2GLMM) library(car) knitr::opts_chunk$set(cache = FALSE, fig.align = "center", fig.width = 6, fig.height = 6, cache.path = "./cache_knitr/LM_practice_ex/", fig.path = "./fig_knitr/LM_practice_ex/")
In this vignette I illustrate the key steps required to solve the exercises. By no means I am trying to provide a detailed report of the analysis of each dataset. Any such report may need additional analyses and results would have to be written in good English, not in telegraphic style or lines of codes.
r .emo("goal")
Test whether diet significantly affects the weights of six weeks old chickens.
r .emo("goal")
Compare the effect of different feed supplement types.
r .emo("goal")
Plot mean effect for each feed supplement types showing 95% confidence intervals.
We always start by looking at the data we have:
str(chickwts) plot(weight ~ feed, data = chickwts)
I don't know much about chicken but the values seem right.
We look at the distribution of chicken among the sources of feed:
table(chickwts$feed)
We check if there is any missing data:
any(is.na(chickwts))
Great! There is no missing data.
We fit the model:
(fit_chick <- lm(weight ~ feed, data = chickwts))
Let's have a look at the residuals of the model:
par(mfrow = c(2, 2)) plot(fit_chick)
Everything looks very good. The errors should thus be independent, homoscedastic and normally distributed.
Let's check potential outliers:
infl_res <- influence.measures(fit_chick) infl_res$infmat[rowSums(infl_res$is.inf) > 0, , drop = FALSE] plot(fit_chick, which = 4)
Nothing too bad here, just one observation (54) a little different from the rest in its group (meatmeal):
chickwts[chickwts$feed == "meatmeal", ]
Notes:
anova(fit_chick)
The diet seems to influence significantly the growth of chicken (F = r round(anova(fit_chick)$F[1], 2)
, df1 = r anova(fit_chick)$Df[1]
, df2 = r anova(fit_chick)$Df[2]
, p-value < 0.001).
pred <- predict(fit_chick, newdata = data.frame(feed = levels(chickwts$feed)), interval = "confidence") pred <- as.data.frame(pred) pred$feed <- levels(chickwts$feed) rownames(pred) <- pred$feed ## for automatic names in barplot pred
Let's plot this:
plot(pred[, "fit"] ~ I(1:nrow(pred) + 0.05), axes = FALSE, pch = 20, ylim = range(fit_chick$model$weight), ylab = "Chicken weight (gr)", xlab = "Diet") axis(1, 1:nrow(pred), labels = rownames(pred)) axis(2) box() for (row in 1:nrow(pred)) { arrows(x0 = row + 0.05, y0 = pred[row, "lwr"], x1 = row + 0.05, y1 = pred[row, "upr"], code = 3, angle = 90, length = 0.1) diet <- levels(chickwts$feed)[row] weights.to.plot <- chickwts$weight[chickwts$feed == diet] points(rep(row, length(weights.to.plot)) - 0.05, weights.to.plot, col = "grey") }
The plot shows the predicted mean weight of chicken for each diet (black) with associated 95% confidence intervals. The observed weights of the chicken included in the analysis are represented by grey dots.
To make the plot even better we could have sorted the diet by their effects and separating points by adding some jitter so that points overlapping on top of each others becomes more spread out.
We can also use {ggplot2} if you prefer:
library(ggplot2) ggplot() + geom_point(aes(y = fit, x = forcats::fct_reorder(feed, fit, .desc = TRUE)), data = pred) + geom_errorbar(aes( x = forcats::fct_reorder(feed, fit, .desc = TRUE), min = lwr, ymax = upr), , data = pred, width = 0.2) + geom_jitter(aes(y = weight, x = feed), height = 0, width = 0.1, shape = 1, colour = "blue", data = chickwts, inherit.aes = FALSE) + labs(x = "Diet", y = "Weight of 6 weeks old chicken (gr)") + theme_bw()
summary(posthoc <- multcomp::glht(fit_chick, linfct = multcomp::mcp(feed = "Tukey"))) par(oma = c(0, 8, 0, 0)) plot(posthoc)
Instead of comparing all diets to each others, we can also choose specific comparisons so as to save some statistical power:
library(multcomp) comp <- glht(model = fit_chick, linfct = mcp(feed = c("sunflower - casein == 0", "sunflower - horsebean == 0", "sunflower - linseed == 0", "sunflower - meatmeal == 0", "sunflower - soybean == 0"))) summary(comp) old_par <- par(mar = c(5, 10, 4, 2)) plot(comp) par(old_par)
We clean the session for the next exercise:
rm(list = ls())
r .emo("goal")
Find out if the acid concentration influences the loss of ammonia by the plants.
str(stackloss) summary(stackloss) pairs(stackloss) any(is.na(stackloss))
We will consider all predictors since water and air flow seem to be strong determinants of the loss of ammonia in plants.
fit_stack <- lm(stack.loss ~ ., data = stackloss)
Note: the .
is a shortcut to include all predictors at once.
Let’s examine the linearity assumption by plotting the residuals against each column of the design matrix (but the intercept):
for (col in 2:ncol(model.matrix(fit_stack))) { plot(rstandard(fit_stack) ~ model.matrix(fit_stack)[, col]) title(main = colnames(model.matrix(fit_stack))[col]) }
It looks messy but there is no obvious pattern suggesting a particular transformation of the predictors.
cor(model.matrix(fit_stack)[ , -1]) ## we discard the intercept for that cov2cor(vcov(fit_stack)) vif(fit_stack)
There are some multicollinearity problems but it does not seem to impact much the effect of acidity which is our focus. Note that the very strong correlation between slopes and the intercept are to be expected, as well as the one between a variable and itself.
We will assume that no there is no measurement error and will focus on the analysis of the residuals of the model:
par(mfrow = c(2, 2)) plot(fit_stack)
That is not fantastic but it is not specifically worse than what we may get under the assumption of LM with so few data. We can check this by simulating data under the assumption of LM and comparing the plots we get:
par(mfrow = c(2, 2)) plot(update(fit_stack, as.matrix(simulate(fit_stack)) ~ .)) ## simulation 1 plot(update(fit_stack, as.matrix(simulate(fit_stack)) ~ .)) ## simulation 2
It does not look particularly better.
We can also use the package {DHARMa} which compares your residuals to many simulations under the data generating process hypothesised (more details in GLM course):
DHARMa::simulateResiduals(fit_stack, plot = TRUE)
It is hard to be sure due to the size of the dataset but we will consider the assumptions fulfilled.
Because the model involves several predictors, before anything else we need to make sure that the overall model offers a higher goodness of fit than the null model:
fit_stack_H0 <- update(fit_stack, . ~ 1) anova(fit_stack_H0, fit_stack)
All good: the model predicts the data significantly better than the null model.
summary(fit_stack)$coef["Acid.Conc.", ]
The acid seems to have no significant influence on the loss of ammonia (t = r round(summary(fit_stack)$coef["Acid.Conc.", 3], 2)
, df = r fit_stack$df.residual
, p-value = r round( summary(fit_stack)$coef["Acid.Conc.", 4], 2)
).
Let's plot the effect very simply using the package {effects}.
plot(effects::predictorEffect("Acid.Conc.", mod = fit_stack))
Because we are specifically interested in the effect of the acid and that this effect may depend on the other predictors, we could check if considering interactions between the acidity and other predictors would improve the goodness of fit of the model:
fit_stack_int <- lm(stack.loss ~ Acid.Conc. * (Air.Flow + Water.Temp), data = stackloss) anova(fit_stack, fit_stack_int) vif(fit_stack_int)
The goodness of fit is not significantly better and there are huge multicollinearity issues, so let's not go there.
We clean the session for the next exercise:
rm(list = ls())
We clean the session for the next exercise:
rm(list = ls())
r .emo("goal")
Plot the influence of potential determinants of fertility.
r .emo("goal")
Test the effect of different predictors on fertility.
str(swiss) summary(swiss) pairs(swiss) any(is.na(swiss))
fit_swiss <- lm(Fertility ~ ., data = swiss)
Let's examine the linearity assumption by plotting the (standardised) residuals against each column of the design matrix (but the intercept):
for (col in 2:ncol(model.matrix(fit_swiss))) { plot(rstandard(fit_swiss) ~ model.matrix(fit_swiss)[, col]) title(main = colnames(model.matrix(fit_swiss))[col]) }
No obvious non-linear pattern emerges.
We can look at the correlation between the column vectors of the model matrix:
cor(model.matrix(fit_swiss)[, -1])
We can also look at the correlation between parameter estimates:
cov2cor(vcov(fit_swiss))
There are some noticeable correlations between predictors and between parameter estimates. Let's double check the extent to which multicollinearity could be an issue by looking at the variance inflation factors:
vif(fit_swiss)
The situation is not ideal but nothing looks catastrophic. Let's see if the situation gets better once the variable Examination
is omitted:
fit_swiss_bis <- update(fit_swiss, . ~ . - Examination) vif(fit_swiss_bis)
The situation is much better so we will study predictions stemming from the fit with Examination
and from the fit without it, to examine the robustness of our results.
Because we do not know on how many individuals the percentages have been based, we will assume that they did so on large sample and that this assumption is fulfilled.
par(mfrow = c(2, 2)) plot(fit_swiss) plot(fit_swiss_bis)
Both model fits look alright (even if not super great).
Because the model involves several predictors, before anything else we need to make sure that the overall model offers a higher goodness of fit than the null model:
fit_swiss_H0 <- update(fit_swiss, . ~ 1) anova(fit_swiss_H0, fit_swiss) anova(fit_swiss_H0, fit_swiss_bis)
Both of our models fit the data clearly better than the null model.
Because this is a simple LM, you can use the function predictorEffects()
from the package {effects} to do that automatically:
plot(effects::predictorEffects(fit_swiss)) plot(effects::predictorEffects(fit_swiss_bis))
Many other packages can be used to draw similar plots (although details on how predictions are computed vary). One example is {visreg}:
par(mfrow = c(2, 2)) plot(visreg::visreg(fit_swiss_bis))
Both models predict the same overall trends. To illustrate more precisely the difference in predicted values between the two models, one would have to compute the predictions by hand and plot them. Let's do that to illustrate how the prediction for the mean effect of education differ between models:
data.for.pred <- data.frame("Agriculture" = mean(swiss$Agriculture), "Examination" = mean(swiss$Examination), "Education" = seq(min(swiss$Education), max(swiss$Education), length = 30), "Catholic" = mean(swiss$Catholic), "Infant.Mortality" = mean(swiss$Infant.Mortality) ) pred1 <- predict(fit_swiss, newdata = data.for.pred, interval = "confidence") pred2 <- predict(fit_swiss_bis, newdata = data.for.pred, interval = "confidence") pred1_df_temp <- as.data.frame(pred1) ## to turn the matrix into a data frame pred2_df_temp <- as.data.frame(pred2) ## to turn the matrix into a data frame pred1_df <- cbind(data.for.pred, pred1_df_temp) ## we add the new data pred2_df <- cbind(data.for.pred, pred2_df_temp) ## we add the new data
plot(fit ~ Education, data = pred1_df, col = "blue", lwd = 2, type = "l", ylab = "Predicted Standardized Fertility (+/- 95% CI)", xlab = "Education (%)") points(lwr ~ Education, data = pred1_df, col = "blue", lwd = 2, lty = 2, type = "l") points(upr ~ Education, data = pred1_df, col = "blue", lwd = 2, lty = 2, type = "l") points(fit ~ Education, data = pred2_df, col = "orange", lwd = 2, type = "l") points(lwr ~ Education, data = pred2_df, col = "orange", lwd = 2, lty = 2, type = "l") points(upr ~ Education, data = pred2_df, col = "orange", lwd = 2, lty = 2, type = "l") legend("topright", fill = c("blue", "orange"), legend = c("fit_swiss", "fit_swiss_bis"), title = "Model:", bty = "n")
Note that while predicting the effect of Education
we set all other covariates to their mean although other choices are possible.
Same with {ggplot2}:
full_pred <- rbind(cbind(pred1_df, mod = "fit_swiss"), cbind(pred2_df, mod = "fit_swiss_bis")) ## we combine all the data full_pred ggplot(full_pred) + aes(fill = mod, x = Education) + geom_ribbon(aes(ymax = upr, ymin = lwr), alpha = 0.3) + geom_line(aes(y = fit, colour = mod)) + labs(y = "Predicted Standardized Fertility (+/- 95% CI)", x = "Education (%)") + theme_minimal()
We clean the session for the next exercise:
rm(list = ls())
r .emo("goal")
Test whether insecticide type significantly affects the number of insects in agricultural experimental plots.
r .emo("goal")
What is the mean number of insects we would predict to find on a plot treated with insecticide C?
str(InsectSprays) range(InsectSprays$count) table(InsectSprays$spray) plot(count ~ spray, data = InsectSprays) any(is.na(InsectSprays))
(fit_insect <- lm(count ~ spray, data = InsectSprays))
As for the dataset chickwts
, because we only study the effect of one qualitative predictor, we expect linearity and no multicollinearity, and that the values for the predictor are fixed (as assumed). We can thus directly turn to the analysis of the prediction errors:
par(mfrow = c(2, 2)) plot(fit_insect)
That does not look so good. There seem to be some heteroscedasticity in the data. Let's test this possibility:
lmtest::bptest(fit_insect)
The Breusch-Pagan test shows that we can reject the assumption of homoscedasticity. One option would be to try to use GLM but we have not yet seen this, so let's try to reduce the problems and apply a LM fit anyhow. The first thing top try to solve the problems is to use a power transformation of the response variable.
For the Box Cox transformation to work, we need all the values to be strictly positive, but we saw that the minimum value is zero, so let's start by adding one to every count and let's refit the model.
InsectSprays$count_bis <- InsectSprays$count + 1 fit_insect_bis <- update(fit_insect, count_bis ~ .)
We can now start the Box Cox analysis:
car::boxCox(fit_insect_bis) summary(bc <- car::powerTransform(fit_insect_bis))
The estimated value for lambda is close to 1/3 which corresponds to a cube root exponent to the observation but this is not particularly meaningful so we will stick to the best estimate for lambda which is r bc$lambda
and stored in the object bc$lambda
.
Let's fit the model on the transformed data and display the new diagnostic plots:
InsectSprays$count_bc <- car::bcPower(InsectSprays$count_bis, bc$lambda) fit_insect_bc <- update(fit_insect_bis, count_bc ~ .) par(mfrow = c(2, 2)) plot(fit_insect_bc)
That looks fantastic! The heteroscedasticity is gone:
lmtest::bptest(fit_insect_bc)
The lack of dependence should not be an issue (although the original paper where these data have been published showed that the experimental design was not as simple as the dataset here suggests), but let's check anyhow:
lmtest::dwtest(fit_insect_bc) car::durbinWatsonTest(fit_insect_bc)
Independence is not rejected, irrespective of whether doing a unilateral or bilateral test.
The normality looked good on the plots but we could also test it:
nortest::lillie.test(residuals(fit_insect_bc)) shapiro.test(residuals(fit_insect_bc))
Again, no issue here.
Note: once again, because there is only one predictor, the following test also corresponds to testing the fitted model against the fit of a null model.
anova(fit_insect_bc) ## again, same here as car::Anova as only one predictor
The type of insecticide seems to influence significantly the number of insects counted (F = r round(anova(fit_insect_bc)$F[1], 2)
, df1 = r anova(fit_insect_bc)$Df[1]
, df2 = r anova(fit_insect_bc)$Df[2]
, p-value < 0.001).
(meanC_BC <- predict(fit_insect_bc, newdata = data.frame(spray = "C"), interval = "confidence"))
This corresponds to the number of insects + 1 on the Box Cox scale. Thus we need to transform this result back on the original scale:
(meanC <- ((meanC_BC * bc$lambda) + 1)^(1/bc$lambda) - 1) car::bcnPowerInverse(meanC_BC, lambda = bc$lambda, gamma = 0) - 1 ## same thing with less maths
We thus expect to find on average (95%CI) r round(meanC[1, "fit"], 2)
(r round(meanC[1, "lwr"], 2)
-- r round(meanC[1, "upr"], 2)
) insects on agricultural experimental units treated with the insecticide C.
Let's use this exercise to illustrate how you can "easily" plot the effect of the different sprays using {effects} despite the BoxCox transformation:
f.inverse <- function(x) car::bcnPowerInverse(x, lambda = bc$lambda, gamma = 0) - 1 predict_spray <- effects::predictorEffect("spray", fit_insect_bc) plot(predict_spray, axes = list(y = list(transform = list(inverse = f.inverse), type = "response", lab = "Count (UnBoxCoxed)", lim = c(0, 25)), grid = TRUE))
That also works with {visreg}:
visreg::visreg(fit_insect_bc, trans = f.inverse, ylab = "Count (UnBoxCoxed)", gg = TRUE) + scale_y_continuous(limits = c(0, 25)) + theme_bw()
For GLM, we will see later that {effects} and {visreg} have issues, but these packages work fine for LM. They do differ however about what they assume for non focal effect (more on that during the GLM course), although here it has no effect since we only have considered a single predictor.
We clean the session for the next exercise:
rm(list = ls())
r .emo("goal")
Predict the proportion of chickens larger than 300g for each feed supplements.
fit_chick <- lm(weight ~ feed, data = chickwts)
We have already studied this model fit above, so we won't do it again here.
We start by predicting the weight of chicken of all type of feed:
(pred2 <- predict(fit_chick, newdata = data.frame(feed = levels(chickwts$feed)), interval = "prediction", se.fit = TRUE))
Note: we need the se.fit
since we use it to compute the distribution of future observations.
Such distribution is gaussian under the assumptions of LM, as a mean predicted by pred2
and SD corresponding to the square root of the response variance:
(se.pred <- sqrt(pred2$se.fit^2 + pred2$residual.scale^2))
We can thus use the distribution function of the normal distribution to get what we want:
freq_more300 <- pnorm(q = 300, mean = pred2$fit[, "fit"], sd = se.pred, lower.tail = FALSE) output <- as.data.frame(cbind(pred2$fit, "%>300gr" = round(freq_more300, 2))) output$feed <- levels(chickwts$feed) output
We can plot the outcome like this:
barplot(sort(output$`%>300gr`), names.arg = output$feed[order(output$`%>300gr`)], ylim = c(0, 1), ylab = "Predicted proportion of chicken larger than 300 gr")
or using {ggplot2} and another tidy package ({forcats}) to reorganise the order of the factor levels:
output$feed2 <- forcats::fct_reorder(output$feed, output$`%>300gr`) ggplot(output) + geom_col(aes(x = feed2, y = `%>300gr`)) + labs(y = "Predicted proportion of chicken larger than 300 gr") + theme_minimal()
We clean the session for the next exercise:
rm(list = ls())
r .emo("goal")
What is the allometric exponent for the growth of brain size with body mass?
r .emo("goal")
Does this exponent agree with the usual expectation of 2/3?
r .emo("goal")
How large do we expect the brain of a 1kg animal to be?
r .emo("goal")
Rank organisms by relative brain size (i.e. controlled for body size)
r .emo("goal")
Repeat the analysis with the dataset Animals
head(mammals) str(mammals) plot(brain ~ body, data = mammals) plot(log(brain) ~ log(body), data = mammals, pch = 20, col = "blue") text(log(mammals$body), log(mammals$brain) + 0.4, labels = rownames(mammals), cex = 0.8)
This is the kind of things for which knowing {ggplot2} and its companion packages {ggrepel} and {scales} helps a little:
ggplot(mammals) + aes(y = brain, x = body, label = rownames(mammals)) + geom_point() + #geom_smooth() + ## to show regression smoother ggrepel::geom_text_repel(max.overlaps = Inf) + scale_x_continuous(trans = "log10", labels = scales::label_number()) + scale_y_continuous(trans = "log10", labels = scales::label_number()) + theme_minimal()
mammals$log_brain <- log(mammals$brain) mammals$log_body <- log(mammals$body)
Note: you can use any log transformation, it will make no difference for the allometry coefficient and for the untransformed fitted values (as long as you untransform the data correctly and as long as you use the same transformation for x
and y
).
fit <- lm(log_brain ~ log_body, data = mammals) plot(fit)
There seem to be some outliers but otherwise the diagnostics look good.
So let's examine the outliers in more details:
im <- influence.measures(fit) im$infmat[rowSums(im$is.inf) > 0, , drop = FALSE] ## extract candidate outliers plot(fit, which = 4)
Humans seem to be a little problematic because they exert a strong effect on the slope (check column dfb.lg_b
) and thus on the estimate of the allometry relationship... so let's refit the model without them:
summary(fit) ## fit with humans fit_nohuman <- lm(log_brain ~ log_body, data = mammals[rownames(mammals) != "Human", ]) summary(fit_nohuman) ## fit without humans plot(fit_nohuman) im2 <- influence.measures(fit_nohuman) im2$infmat[rowSums(im2$is.inf) > 0, , drop = FALSE] plot(fit_nohuman, which = 4)
The fit diagnostics are still OK without humans, so we could use either model, or even better: both!
Let's use three equivalent ways to test the allometry coefficient:
confint(fit)
0.6666... is not in the confidence interval, so the estimate differs from 2/3.
(t <- (fit$coefficients[2] - 2/3) / sqrt(diag(vcov(fit))[2])) 2*pt(abs(t), df = fit$df.residual, lower.tail = FALSE)
Indeed, the null hypothesis of an allometry coefficient of 2/3 is rejected.
fit2 <- lm(log_brain ~ 1 + offset(2/3*log_body), data = mammals) anova(fit2, fit)
Same p-value, as expected.
We can repeat the same on the fit without humans, just to be sure our results are robust to this outlier:
confint(fit_nohuman) t <- (fit_nohuman$coefficients[2] - 2/3) / sqrt(diag(vcov(fit_nohuman))[2]) 2*pt(abs(t), df = fit_nohuman$df.residual, lower.tail = FALSE) fit2_nohuman <- lm(log_brain ~ 1 + offset(2/3*log_body), data = mammals[rownames(mammals) != "Human", ]) anova(fit2_nohuman, fit_nohuman)
Same results, good!
We can plot the predictions from the models:
plot(log_brain ~ log_body, data = mammals, col = "blue") text(mammals$log_body, mammals$log_brain + 0.5, labels = rownames(mammals), cex = 0.8) abline(fit, col = "blue", lwd = 2) abline(fit_nohuman, col = "blue", lwd = 2, lty = 2) abline(coef(fit2), 2/3, col = "red", lwd = 2) legend("topleft", lty = c(1, 1, 2), lwd = c(2, 2, 2), col = c("blue", "red", "blue"), legend = c("slope estimated", "slope = 2/3", "slope estimated (no human)"))
p <- predict(fit_nohuman, newdata = data.frame(log_body = log(1))) exp(p)
An animal of 1 kg is predicted to have a brain size of r round(exp(p), 2)
grams.
NOTE: This is the same as the intercept because log(1) = 0
.
exp(coefficients(fit_nohuman)["(Intercept)"])
matrix(names(sort(resid(fit), decreasing = TRUE)), ncol = 1)
We clean the session for the next exercise:
rm(list = ls())
r .emo("goal")
What is the allometric exponent for the growth of brain size with body mass?
r .emo("goal")
Does this exponent agree with the usual expectation of 2/3?
r .emo("goal")
How large do we expect the brain of a 1kg animal to be?
r .emo("goal")
Rank organisms by relative brain size (i.e. controlled for body size)
head(Animals) str(Animals) plot(brain ~ body, data = Animals) plot(log(brain) ~ log(body), data = Animals, col = "blue", ylim = c(min(log(brain)), max(log(brain)) + 0.5)) text(log(Animals$body), log(Animals$brain) + 0.5, labels = rownames(Animals), cex = 0.8)
Animals$log_brain <- log(Animals$brain) Animals$log_body <- log(Animals$body)
fit <- lm(log_brain ~ log_body, data = Animals) plot(fit) im <- influence.measures(fit) im$infmat[rowSums(im$is.inf) > 0, , drop = FALSE] ## extract candidate outliers plot(fit, which = 4)
Here the biggest problem comes from dinosaurs, which do not seem to follow the same relationship than other animals.
Animals_nodino <- Animals[!rownames(Animals) %in% c("Triceratops", "Brachiosaurus", "Dipliodocus"), ] fit_nodino <- lm(log_brain ~ log_body, data = Animals_nodino) plot(fit_nodino) im <- influence.measures(fit_nodino) im$infmat[rowSums(im$is.inf) > 0, , drop = FALSE] ## extract candidate outliers plot(fit_nodino, which = 4)
That is not great, but much better!
This time we will use a package to test the allometry coefficient.
summary(multcomp::glht(fit, linfct = c("log_body = 0.66666666667")))
Note: you cannot write 2/3
r .emo("broken")
plot(log_brain ~ log_body, data = Animals, col = NULL) text(x = Animals$log_body, y = Animals$log_brain, labels = rownames(Animals), cex = 0.8, col = ifelse( rownames(Animals) %in% c("Triceratops", "Brachiosaurus", "Dipliodocus"), "red", "black")) p <- predict(fit, newdata = data.frame("log_body" = range(log(Animals$body)))) p_nodino <- predict(fit_nodino, newdata = data.frame("log_body" = range(log(Animals$body)))) points(x = range(Animals$log_body), y = p, col = "red", type = "l", lty = 2) points(x = range(Animals$log_body), y = p_nodino, col = "blue", type = "l")
p <- predict(fit_nodino, newdata = data.frame(log_body = log(1))) exp(p)
An animal of 1kg is predicted to have a brain size of r round(exp(p), 2)
grams.
NOTE: Again, this is the same as the intercept because log(1) = 0
.
exp(coefficients(fit_nodino)["(Intercept)"])
matrix(names(sort(resid(fit), decreasing = TRUE)), ncol = 1)
We clean the session for the next exercise:
rm(list = ls())
r .emo("goal")
Compare the approximation of the volume of wood given by $\text{Volume} = c\times\text{Height}\times\text{Girth}^2$ (with $c$ to be estimated) to the approximation of a tree trunk as a cylinder.
str(trees) summary(trees) pairs(trees) any(is.na(trees))
A first solution is to create a new predictor containing the value of $\text{Height}\times\text{Girth}^2$ and then fit a model without intercept which will directly estimate $c$.
trees$HG2 <- trees$Height*trees$Girth^2 fit_trees1 <- lm(Volume ~ HG2 - 1, data = trees)
There is no need to check our first 3 assumptions as the model structure is constrained, but we should check the assumptions concerning the errors.
par(mfrow = c(2, 2)) plot(fit_trees1)
Another solution is to use a transformation of the data based on logarithms.
Indeed, $\text{Volume} = c \times \text{Height} \times \text{Girth}^2$ is not linear but we can turn it into a linear expression by using logs:
$$\text{log(Volume)} = \text{log}(c) + \text{log(Height)} + 2 \times \text{log(Girth)}$$
We will thus substitute variables by their logs and try to fit the following model:
$$ \text{log_Volume} = k + 1 \times \text{log_Height} + 2 \times \text{log_Girth} + \epsilon$$
with $k = \text{log}(c)$. In this LM only $k$ must be estimated and the other slopes are considered as fixed. We thus fit the model as follow:
trees$log_Volume <- log(trees$Volume) trees$log_Height <- log(trees$Height) trees$log_Girth <- log(trees$Girth) fit_trees2 <- lm(log_Volume ~ 1 + offset(log_Height + 2 * log_Girth), data = trees)
par(mfrow = c(2, 2)) plot(fit_trees2)
It looks alright, let's not forget that we only have r nrow(trees)
rows in the datasets.
It is not clear how to compare the model since they are not nested and don't use the same response.
We will here use the r-squared although this is not a method to be recommended in general.
cor(predict(fit_trees1), trees$Volume)^2 cor(exp(predict(fit_trees2)), trees$Volume)^2
The goodness of fits look identical.
The first model fit gives us directly an estimate for $c$:
coef(fit_trees1) confint(fit_trees1)
The second fit gives us an estimate of $k$ (not $c$), which is:
coef(fit_trees2) confint(fit_trees2)
Because $k$ corresponds to the log of $c$, we must raise the obtained value to the exponential to obtain the estimation of $c$:
exp(coef(fit_trees2)) exp(confint(fit_trees2))
We compare the two modelling approaches:
coef(fit_trees1)[[1]] / exp(coef(fit_trees2))[[1]]
Note: both estimates are very close.
The volume of a cylinder is the height times the surface of a circle, so $h \pi r^2$ with $r$ the radius or $h\pi(d/2)^2$ with $d$ the diameter. The girth is here the diameter as mentioned in the help file and not the perimeter.
In the formula, all units should be the same but here the diameter is in inches but the height is measured in feet and the volume in cubic feet. This thus lead to the volume of the tree as a cylinder being approximated by $\text{Height} \times \pi \times \left( \text{conv}\text{in_to_ft} \text{Radius} \right)^2$. Since one foot is 12 inches, $\text{conv}\text{in_to_ft} = 1/12$, leading to $\text{Volume} = \text{Height} \times \pi \times \left(\frac{\text{Girth}}{24}\right)^2$.
trees$HPG2 <- trees$Height*pi*(trees$Girth/24)^2 fit_trees3 <- lm(Volume ~ -1 + offset(HPG2), data = trees)
We can compare both models explicitly:
anova(fit_trees3, fit_trees1)
fit_trees1
is a much better fit of the data than fig_trees3
Let us plot the differences in prediction between models:
girth.for.pred <- seq(min(trees$Girth), max(trees$Girth), length = 4) height.for.pred <- seq(min(trees$Height), max(trees$Height), length = 10) data.for.pred <- expand.grid(Girth = girth.for.pred, Height = height.for.pred) data.for.pred$HG2 <- data.for.pred$Height*data.for.pred$Girth^2 data.for.pred$HPG2 <- data.for.pred$Height*pi*(data.for.pred$Girth/24)^2 pred1 <- predict(fit_trees1, newdata = data.for.pred) pred2 <- predict(fit_trees3, newdata = data.for.pred) data.for.plot <- as.data.frame(cbind(data.for.pred, fit1 = pred1, fit2 = pred2)) plot(Volume ~ Height, data = trees, cex = 5*trees$Girth/max(trees$Girth), pch = 21, ylim = range(c(data.for.plot$fit, data.for.plot$fit2, trees$Volume)), ylab = "Volume (cubic feet)", xlab = "Height (meters)", col = "brown", bg = "yellow", lwd = 3) for (girth_class in 1:length(unique(girth.for.pred))) { with(subset(data.for.plot, data.for.plot$Girth == unique(girth.for.pred)[girth_class]), { points(fit1 ~ Height, type = "l", lwd = girth_class) points(fit2 ~ Height, type = "l", lwd = girth_class, col = "red") }) }
The plot shows the influence of height on the mean volume for 4 specific girths (r round(girth.for.pred, 2)
inches). Thicker lines represent trees with larger girth. The black lines depict the predictions from the model fit_trees
and the red ones are for the model fit_trees_cylinder
. The raw data are represented by the brown and yellow dots. The diameter of these dots is proportional to the actual measured girth.
That the approximation of the volume of trees as cylinders over-estimate the actual volume may be due to the fact that the diameter of trees goes down as you climb up the tree...
We clean the session:
rm(list = ls())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.