library(LM2GLMM)
library(car)
knitr::opts_chunk$set(cache = FALSE, fig.align = "center", fig.width = 6, fig.height = 6,
                      cache.path = "./cache_knitr/LM_practice_ex/", fig.path = "./fig_knitr/LM_practice_ex/")

Disclamer

In this vignette I illustrate the key steps required to solve the exercises. By no means I am trying to provide a detailed report of the analysis of each dataset. Any such report may need additional analyses and results would have to be written in good English, not in telegraphic style or lines of codes.


Dataset: chickwts -- part 1 (difficulty = 0/5)

r .emo("goal") Test whether diet significantly affects the weights of six weeks old chickens.

r .emo("goal") Compare the effect of different feed supplement types.

r .emo("goal") Plot mean effect for each feed supplement types showing 95% confidence intervals.

Exploring the data

We always start by looking at the data we have:

str(chickwts)
plot(weight ~ feed, data = chickwts)

I don't know much about chicken but the values seem right.

We look at the distribution of chicken among the sources of feed:

table(chickwts$feed)

We check if there is any missing data:

any(is.na(chickwts))

Great! There is no missing data.

Fitting the model

We fit the model:

(fit_chick <- lm(weight ~ feed, data = chickwts))

Checking the assumptions

Let's have a look at the residuals of the model:

par(mfrow = c(2, 2))
plot(fit_chick)

Everything looks very good. The errors should thus be independent, homoscedastic and normally distributed.

Let's check potential outliers:

infl_res <- influence.measures(fit_chick)
infl_res$infmat[rowSums(infl_res$is.inf) > 0, , drop = FALSE]
plot(fit_chick, which = 4)

Nothing too bad here, just one observation (54) a little different from the rest in its group (meatmeal):

chickwts[chickwts$feed == "meatmeal", ]

Testing the effect of diet

Notes:

anova(fit_chick)

The diet seems to influence significantly the growth of chicken (F = r round(anova(fit_chick)$F[1], 2), df1 = r anova(fit_chick)$Df[1], df2 = r anova(fit_chick)$Df[2], p-value < 0.001).

Predicting the mean effect of each diet

pred <- predict(fit_chick, newdata = data.frame(feed = levels(chickwts$feed)),
                interval = "confidence")
pred <- as.data.frame(pred)
pred$feed <- levels(chickwts$feed)
rownames(pred) <- pred$feed ## for automatic names in barplot
pred

Let's plot this:

plot(pred[, "fit"] ~ I(1:nrow(pred) + 0.05), axes = FALSE, pch = 20,
     ylim = range(fit_chick$model$weight),
     ylab = "Chicken weight (gr)", xlab = "Diet")
axis(1, 1:nrow(pred), labels = rownames(pred))
axis(2)
box()
for (row in 1:nrow(pred)) {
  arrows(x0 = row + 0.05, y0 = pred[row, "lwr"], x1 = row + 0.05, y1 = pred[row, "upr"],
         code = 3, angle = 90, length = 0.1)
  diet <- levels(chickwts$feed)[row]
  weights.to.plot <- chickwts$weight[chickwts$feed == diet]
  points(rep(row, length(weights.to.plot)) - 0.05, weights.to.plot, col = "grey")
  }

The plot shows the predicted mean weight of chicken for each diet (black) with associated 95% confidence intervals. The observed weights of the chicken included in the analysis are represented by grey dots.

To make the plot even better we could have sorted the diet by their effects and separating points by adding some jitter so that points overlapping on top of each others becomes more spread out.

We can also use {ggplot2} if you prefer:

library(ggplot2)
ggplot() +
  geom_point(aes(y = fit, x = forcats::fct_reorder(feed, fit, .desc = TRUE)), data = pred) +
  geom_errorbar(aes( x = forcats::fct_reorder(feed, fit, .desc = TRUE), min = lwr, ymax = upr), , data = pred, width = 0.2) +
  geom_jitter(aes(y = weight, x = feed),
              height = 0,
              width = 0.1,
              shape = 1, colour = "blue",
              data = chickwts, inherit.aes = FALSE) +
  labs(x = "Diet", y = "Weight of 6 weeks old chicken (gr)") +
  theme_bw()

Testing which diet differs from each others

summary(posthoc <- multcomp::glht(fit_chick, linfct = multcomp::mcp(feed = "Tukey")))
par(oma = c(0, 8, 0, 0))
plot(posthoc)

Instead of comparing all diets to each others, we can also choose specific comparisons so as to save some statistical power:

library(multcomp)
comp <- glht(model = fit_chick,
             linfct = mcp(feed = c("sunflower - casein == 0",
                                   "sunflower - horsebean == 0",
                                   "sunflower - linseed == 0",
                                   "sunflower - meatmeal == 0",
                                   "sunflower - soybean == 0")))
summary(comp)
old_par <- par(mar = c(5, 10, 4, 2))
plot(comp)
par(old_par)

Clean up

We clean the session for the next exercise:

rm(list = ls())


Dataset: stackloss (difficulty = 1/5)

r .emo("goal") Find out if the acid concentration influences the loss of ammonia by the plants.

Exploring the data

str(stackloss)
summary(stackloss)
pairs(stackloss)
any(is.na(stackloss))

Fitting the model

We will consider all predictors since water and air flow seem to be strong determinants of the loss of ammonia in plants.

fit_stack <- lm(stack.loss ~ ., data = stackloss)

Note: the . is a shortcut to include all predictors at once.

Checking the assumptions

Linearity

Let’s examine the linearity assumption by plotting the residuals against each column of the design matrix (but the intercept):

for (col in 2:ncol(model.matrix(fit_stack))) {
  plot(rstandard(fit_stack) ~ model.matrix(fit_stack)[, col])
  title(main = colnames(model.matrix(fit_stack))[col])
}

It looks messy but there is no obvious pattern suggesting a particular transformation of the predictors.

Multicollinearity

cor(model.matrix(fit_stack)[ , -1]) ## we discard the intercept for that
cov2cor(vcov(fit_stack))
vif(fit_stack)

There are some multicollinearity problems but it does not seem to impact much the effect of acidity which is our focus. Note that the very strong correlation between slopes and the intercept are to be expected, as well as the one between a variable and itself.

Other assumptions

We will assume that no there is no measurement error and will focus on the analysis of the residuals of the model:

par(mfrow = c(2, 2))
plot(fit_stack)

That is not fantastic but it is not specifically worse than what we may get under the assumption of LM with so few data. We can check this by simulating data under the assumption of LM and comparing the plots we get:

par(mfrow = c(2, 2))
plot(update(fit_stack, as.matrix(simulate(fit_stack)) ~ .))  ## simulation 1
plot(update(fit_stack, as.matrix(simulate(fit_stack)) ~ .))  ## simulation 2

It does not look particularly better.

We can also use the package {DHARMa} which compares your residuals to many simulations under the data generating process hypothesised (more details in GLM course):

DHARMa::simulateResiduals(fit_stack, plot = TRUE)

It is hard to be sure due to the size of the dataset but we will consider the assumptions fulfilled.

Comparison to the null model

Because the model involves several predictors, before anything else we need to make sure that the overall model offers a higher goodness of fit than the null model:

fit_stack_H0 <- update(fit_stack, . ~ 1)
anova(fit_stack_H0, fit_stack)

All good: the model predicts the data significantly better than the null model.

Testing the effect of the acid concentration

summary(fit_stack)$coef["Acid.Conc.", ]

The acid seems to have no significant influence on the loss of ammonia (t = r round(summary(fit_stack)$coef["Acid.Conc.", 3], 2), df = r fit_stack$df.residual, p-value = r round( summary(fit_stack)$coef["Acid.Conc.", 4], 2)).

Let's plot the effect very simply using the package {effects}.

plot(effects::predictorEffect("Acid.Conc.", mod = fit_stack))

Considering interactions?

Because we are specifically interested in the effect of the acid and that this effect may depend on the other predictors, we could check if considering interactions between the acidity and other predictors would improve the goodness of fit of the model:

fit_stack_int <- lm(stack.loss ~ Acid.Conc. * (Air.Flow + Water.Temp), data = stackloss)
anova(fit_stack, fit_stack_int)
vif(fit_stack_int)

The goodness of fit is not significantly better and there are huge multicollinearity issues, so let's not go there.

Clean up

We clean the session for the next exercise:

rm(list = ls())


Clean up

We clean the session for the next exercise:

rm(list = ls())


Dataset: swiss (difficulty = 1/5)

r .emo("goal") Plot the influence of potential determinants of fertility.

r .emo("goal") Test the effect of different predictors on fertility.

Exploring the data

str(swiss)
summary(swiss)
pairs(swiss)
any(is.na(swiss))

Fitting the model

fit_swiss <- lm(Fertility ~ ., data = swiss)

Checking the assumptions

Linearity

Let's examine the linearity assumption by plotting the (standardised) residuals against each column of the design matrix (but the intercept):

for (col in 2:ncol(model.matrix(fit_swiss))) {
  plot(rstandard(fit_swiss) ~ model.matrix(fit_swiss)[, col])
  title(main = colnames(model.matrix(fit_swiss))[col])
}

No obvious non-linear pattern emerges.

Multicollinearity

We can look at the correlation between the column vectors of the model matrix:

cor(model.matrix(fit_swiss)[, -1])

We can also look at the correlation between parameter estimates:

cov2cor(vcov(fit_swiss))

There are some noticeable correlations between predictors and between parameter estimates. Let's double check the extent to which multicollinearity could be an issue by looking at the variance inflation factors:

vif(fit_swiss)

The situation is not ideal but nothing looks catastrophic. Let's see if the situation gets better once the variable Examination is omitted:

fit_swiss_bis <- update(fit_swiss, . ~ . - Examination)
vif(fit_swiss_bis)

The situation is much better so we will study predictions stemming from the fit with Examination and from the fit without it, to examine the robustness of our results.

Fixed values for predictor

Because we do not know on how many individuals the percentages have been based, we will assume that they did so on large sample and that this assumption is fulfilled.

Errors

par(mfrow = c(2, 2))
plot(fit_swiss)
plot(fit_swiss_bis)

Both model fits look alright (even if not super great).

Comparison to the null model

Because the model involves several predictors, before anything else we need to make sure that the overall model offers a higher goodness of fit than the null model:

fit_swiss_H0 <- update(fit_swiss, . ~ 1)
anova(fit_swiss_H0, fit_swiss)
anova(fit_swiss_H0, fit_swiss_bis)

Both of our models fit the data clearly better than the null model.

Plotting the effect of the potential determinant of fertility

Because this is a simple LM, you can use the function predictorEffects() from the package {effects} to do that automatically:

plot(effects::predictorEffects(fit_swiss))
plot(effects::predictorEffects(fit_swiss_bis))

Many other packages can be used to draw similar plots (although details on how predictions are computed vary). One example is {visreg}:

par(mfrow = c(2, 2))
plot(visreg::visreg(fit_swiss_bis))

Both models predict the same overall trends. To illustrate more precisely the difference in predicted values between the two models, one would have to compute the predictions by hand and plot them. Let's do that to illustrate how the prediction for the mean effect of education differ between models:

data.for.pred <- data.frame("Agriculture" = mean(swiss$Agriculture), 
                            "Examination" = mean(swiss$Examination),
                            "Education" = seq(min(swiss$Education), max(swiss$Education), length = 30),
                            "Catholic" = mean(swiss$Catholic),
                            "Infant.Mortality" = mean(swiss$Infant.Mortality)
                            )
pred1 <- predict(fit_swiss, newdata = data.for.pred, interval = "confidence")
pred2 <- predict(fit_swiss_bis, newdata = data.for.pred, interval = "confidence")
pred1_df_temp <- as.data.frame(pred1) ## to turn the matrix into a data frame
pred2_df_temp <- as.data.frame(pred2) ## to turn the matrix into a data frame
pred1_df <- cbind(data.for.pred, pred1_df_temp)  ## we add the new data
pred2_df <- cbind(data.for.pred, pred2_df_temp)  ## we add the new data
plot(fit ~ Education, data = pred1_df, col = "blue", lwd = 2, type = "l",
     ylab = "Predicted Standardized Fertility (+/- 95% CI)", xlab = "Education (%)")
points(lwr ~ Education, data = pred1_df,
       col = "blue", lwd = 2, lty = 2, type = "l")
points(upr ~ Education, data = pred1_df,
       col = "blue", lwd = 2, lty = 2, type = "l")

points(fit ~ Education, data = pred2_df,
       col = "orange", lwd = 2, type = "l")
points(lwr ~ Education, data = pred2_df,
       col = "orange", lwd = 2, lty = 2, type = "l")
points(upr ~ Education, data = pred2_df,
       col = "orange", lwd = 2, lty = 2, type = "l")

legend("topright", fill = c("blue", "orange"),
       legend = c("fit_swiss", "fit_swiss_bis"), title = "Model:", bty = "n")

Note that while predicting the effect of Education we set all other covariates to their mean although other choices are possible.

Same with {ggplot2}:

full_pred <- rbind(cbind(pred1_df, mod = "fit_swiss"),
                   cbind(pred2_df, mod = "fit_swiss_bis")) ## we combine all the data
full_pred
ggplot(full_pred) +
  aes(fill = mod, x = Education) +
  geom_ribbon(aes(ymax = upr, ymin = lwr), alpha = 0.3) +
  geom_line(aes(y = fit, colour = mod)) +
  labs(y = "Predicted Standardized Fertility (+/- 95% CI)", x = "Education (%)") +
  theme_minimal()

Clean up

We clean the session for the next exercise:

rm(list = ls())


Dataset: InsectSprays (difficulty = 2/5)

r .emo("goal") Test whether insecticide type significantly affects the number of insects in agricultural experimental plots.

r .emo("goal") What is the mean number of insects we would predict to find on a plot treated with insecticide C?

Exploring the data

str(InsectSprays)
range(InsectSprays$count)
table(InsectSprays$spray)
plot(count ~ spray, data = InsectSprays)
any(is.na(InsectSprays))

Fitting the model

(fit_insect <- lm(count ~ spray, data = InsectSprays))

Checking the assumptions

As for the dataset chickwts, because we only study the effect of one qualitative predictor, we expect linearity and no multicollinearity, and that the values for the predictor are fixed (as assumed). We can thus directly turn to the analysis of the prediction errors:

par(mfrow = c(2, 2))
plot(fit_insect)

That does not look so good. There seem to be some heteroscedasticity in the data. Let's test this possibility:

lmtest::bptest(fit_insect)

The Breusch-Pagan test shows that we can reject the assumption of homoscedasticity. One option would be to try to use GLM but we have not yet seen this, so let's try to reduce the problems and apply a LM fit anyhow. The first thing top try to solve the problems is to use a power transformation of the response variable.

Box Cox

For the Box Cox transformation to work, we need all the values to be strictly positive, but we saw that the minimum value is zero, so let's start by adding one to every count and let's refit the model.

InsectSprays$count_bis <- InsectSprays$count + 1
fit_insect_bis <- update(fit_insect, count_bis ~ .)

We can now start the Box Cox analysis:

car::boxCox(fit_insect_bis)
summary(bc <- car::powerTransform(fit_insect_bis))

The estimated value for lambda is close to 1/3 which corresponds to a cube root exponent to the observation but this is not particularly meaningful so we will stick to the best estimate for lambda which is r bc$lambda and stored in the object bc$lambda.

Let's fit the model on the transformed data and display the new diagnostic plots:

InsectSprays$count_bc <- car::bcPower(InsectSprays$count_bis, bc$lambda)
fit_insect_bc <- update(fit_insect_bis, count_bc ~ .)
par(mfrow = c(2, 2))
plot(fit_insect_bc)

That looks fantastic! The heteroscedasticity is gone:

lmtest::bptest(fit_insect_bc)

The lack of dependence should not be an issue (although the original paper where these data have been published showed that the experimental design was not as simple as the dataset here suggests), but let's check anyhow:

lmtest::dwtest(fit_insect_bc)
car::durbinWatsonTest(fit_insect_bc)

Independence is not rejected, irrespective of whether doing a unilateral or bilateral test.

The normality looked good on the plots but we could also test it:

nortest::lillie.test(residuals(fit_insect_bc))
shapiro.test(residuals(fit_insect_bc))

Again, no issue here.

Testing the effect of insecticides

Note: once again, because there is only one predictor, the following test also corresponds to testing the fitted model against the fit of a null model.

anova(fit_insect_bc)  ## again, same here as car::Anova as only one predictor

The type of insecticide seems to influence significantly the number of insects counted (F = r round(anova(fit_insect_bc)$F[1], 2), df1 = r anova(fit_insect_bc)$Df[1], df2 = r anova(fit_insect_bc)$Df[2], p-value < 0.001).

Predicting the effect of spray C

(meanC_BC <- predict(fit_insect_bc, newdata = data.frame(spray = "C"), interval = "confidence"))

This corresponds to the number of insects + 1 on the Box Cox scale. Thus we need to transform this result back on the original scale:

(meanC <- ((meanC_BC * bc$lambda) + 1)^(1/bc$lambda) - 1)
car::bcnPowerInverse(meanC_BC, lambda = bc$lambda, gamma = 0) - 1  ## same thing with less maths

We thus expect to find on average (95%CI) r round(meanC[1, "fit"], 2) (r round(meanC[1, "lwr"], 2) -- r round(meanC[1, "upr"], 2)) insects on agricultural experimental units treated with the insecticide C.

Plotting the effect of all sprays

Let's use this exercise to illustrate how you can "easily" plot the effect of the different sprays using {effects} despite the BoxCox transformation:

f.inverse <- function(x) car::bcnPowerInverse(x, lambda = bc$lambda, gamma = 0) - 1
predict_spray <- effects::predictorEffect("spray", fit_insect_bc)
plot(predict_spray,
     axes = list(y = list(transform = list(inverse = f.inverse),
                      type = "response", lab = "Count (UnBoxCoxed)",
                      lim = c(0, 25)), grid = TRUE))

That also works with {visreg}:

visreg::visreg(fit_insect_bc, trans = f.inverse, ylab = "Count (UnBoxCoxed)", gg = TRUE) +
  scale_y_continuous(limits = c(0, 25)) +
  theme_bw()

For GLM, we will see later that {effects} and {visreg} have issues, but these packages work fine for LM. They do differ however about what they assume for non focal effect (more on that during the GLM course), although here it has no effect since we only have considered a single predictor.

Clean up

We clean the session for the next exercise:

rm(list = ls())


Dataset: chickwts -- part 2 (difficulty = 3/5)

r .emo("goal") Predict the proportion of chickens larger than 300g for each feed supplements.

Predicting the proportion of chickens heavier than 300 grams

fit_chick <- lm(weight ~ feed, data = chickwts)

We have already studied this model fit above, so we won't do it again here.

We start by predicting the weight of chicken of all type of feed:

(pred2 <- predict(fit_chick,
                  newdata = data.frame(feed = levels(chickwts$feed)),
                  interval = "prediction",
                  se.fit = TRUE))

Note: we need the se.fit since we use it to compute the distribution of future observations.

Such distribution is gaussian under the assumptions of LM, as a mean predicted by pred2 and SD corresponding to the square root of the response variance:

(se.pred <- sqrt(pred2$se.fit^2 + pred2$residual.scale^2))

We can thus use the distribution function of the normal distribution to get what we want:

freq_more300 <- pnorm(q = 300,
                      mean = pred2$fit[, "fit"],
                      sd = se.pred, lower.tail = FALSE)

output <- as.data.frame(cbind(pred2$fit, "%>300gr" = round(freq_more300, 2)))

output$feed <- levels(chickwts$feed)

output

Plotting the proportion of chicken heavier than 300 grams

We can plot the outcome like this:

barplot(sort(output$`%>300gr`), names.arg = output$feed[order(output$`%>300gr`)],
        ylim = c(0, 1), ylab = "Predicted proportion of chicken larger than 300 gr")

or using {ggplot2} and another tidy package ({forcats}) to reorganise the order of the factor levels:

output$feed2 <- forcats::fct_reorder(output$feed, output$`%>300gr`)

ggplot(output) +
    geom_col(aes(x = feed2, y = `%>300gr`)) +
    labs(y = "Predicted proportion of chicken larger than 300 gr") +
    theme_minimal()

Clean up

We clean the session for the next exercise:

rm(list = ls())


Dataset: mammals (difficulty = 3/5)

r .emo("goal") What is the allometric exponent for the growth of brain size with body mass?

r .emo("goal") Does this exponent agree with the usual expectation of 2/3?

r .emo("goal") How large do we expect the brain of a 1kg animal to be?

r .emo("goal") Rank organisms by relative brain size (i.e. controlled for body size)

r .emo("goal") Repeat the analysis with the dataset Animals

Exploring the data

head(mammals)
str(mammals)
plot(brain ~ body, data = mammals)
plot(log(brain) ~ log(body), data = mammals, pch = 20, col = "blue")
text(log(mammals$body), log(mammals$brain) + 0.4,
     labels = rownames(mammals), cex = 0.8)

This is the kind of things for which knowing {ggplot2} and its companion packages {ggrepel} and {scales} helps a little:

ggplot(mammals) +
  aes(y = brain, x = body, label =  rownames(mammals)) +
  geom_point() +
  #geom_smooth() + ## to show regression smoother
  ggrepel::geom_text_repel(max.overlaps = Inf) +
  scale_x_continuous(trans = "log10", labels = scales::label_number()) +
  scale_y_continuous(trans = "log10", labels = scales::label_number()) +
  theme_minimal()

Reshaping the data

mammals$log_brain <- log(mammals$brain)
mammals$log_body <-  log(mammals$body)

Note: you can use any log transformation, it will make no difference for the allometry coefficient and for the untransformed fitted values (as long as you untransform the data correctly and as long as you use the same transformation for x and y).

Fitting the model

fit <- lm(log_brain ~ log_body, data = mammals)
plot(fit)

There seem to be some outliers but otherwise the diagnostics look good.

So let's examine the outliers in more details:

im <- influence.measures(fit)
im$infmat[rowSums(im$is.inf) > 0, , drop = FALSE] ## extract candidate outliers
plot(fit, which = 4)

Humans seem to be a little problematic because they exert a strong effect on the slope (check column dfb.lg_b) and thus on the estimate of the allometry relationship... so let's refit the model without them:

summary(fit) ## fit with humans
fit_nohuman <- lm(log_brain ~ log_body, data = mammals[rownames(mammals) != "Human", ])
summary(fit_nohuman) ## fit without humans
plot(fit_nohuman)
im2 <- influence.measures(fit_nohuman)
im2$infmat[rowSums(im2$is.inf) > 0, , drop = FALSE]
plot(fit_nohuman, which = 4)

The fit diagnostics are still OK without humans, so we could use either model, or even better: both!

Comparing to an allometry coefficient of 2/3

Let's use three equivalent ways to test the allometry coefficient:

confint(fit)

0.6666... is not in the confidence interval, so the estimate differs from 2/3.

(t <- (fit$coefficients[2] - 2/3) / sqrt(diag(vcov(fit))[2]))
2*pt(abs(t), df = fit$df.residual, lower.tail = FALSE)

Indeed, the null hypothesis of an allometry coefficient of 2/3 is rejected.

fit2 <- lm(log_brain ~ 1 + offset(2/3*log_body), data = mammals)
anova(fit2, fit)

Same p-value, as expected.

We can repeat the same on the fit without humans, just to be sure our results are robust to this outlier:

confint(fit_nohuman)
t <- (fit_nohuman$coefficients[2] - 2/3) / sqrt(diag(vcov(fit_nohuman))[2])
2*pt(abs(t), df = fit_nohuman$df.residual, lower.tail = FALSE)
fit2_nohuman <- lm(log_brain ~ 1 + offset(2/3*log_body), data = mammals[rownames(mammals) != "Human", ])
anova(fit2_nohuman, fit_nohuman)

Same results, good!

Plotting the observed and expected allometric relationships

We can plot the predictions from the models:

plot(log_brain ~ log_body, data = mammals, col = "blue")
text(mammals$log_body, mammals$log_brain + 0.5,
     labels = rownames(mammals), cex = 0.8)
abline(fit, col = "blue", lwd = 2)
abline(fit_nohuman, col = "blue", lwd = 2, lty = 2)
abline(coef(fit2), 2/3, col = "red", lwd = 2)
legend("topleft", lty = c(1, 1, 2), lwd = c(2, 2, 2), col = c("blue", "red", "blue"),
       legend = c("slope estimated", "slope = 2/3", "slope estimated (no human)"))

Prediction of the brain size for an animal of 1kg

p <- predict(fit_nohuman, newdata = data.frame(log_body = log(1)))
exp(p)

An animal of 1 kg is predicted to have a brain size of r round(exp(p), 2) grams.

NOTE: This is the same as the intercept because log(1) = 0.

exp(coefficients(fit_nohuman)["(Intercept)"])

Let us rank the organisms by brain size controlling for body size

matrix(names(sort(resid(fit), decreasing = TRUE)), ncol = 1)

Clean up

We clean the session for the next exercise:

rm(list = ls())


Dataset: Animals (difficulty = 3/5)

r .emo("goal") What is the allometric exponent for the growth of brain size with body mass?

r .emo("goal") Does this exponent agree with the usual expectation of 2/3?

r .emo("goal") How large do we expect the brain of a 1kg animal to be?

r .emo("goal") Rank organisms by relative brain size (i.e. controlled for body size)

Exploring the data

head(Animals)
str(Animals)
plot(brain ~ body, data = Animals)
plot(log(brain) ~ log(body), data = Animals, col = "blue",
     ylim = c(min(log(brain)), max(log(brain)) + 0.5))
text(log(Animals$body), log(Animals$brain) + 0.5,
     labels = rownames(Animals), cex = 0.8)

Reshaping the data

Animals$log_brain <- log(Animals$brain)
Animals$log_body <-  log(Animals$body)

Fitting the model

fit <- lm(log_brain ~ log_body, data = Animals)
plot(fit)
im <- influence.measures(fit)
im$infmat[rowSums(im$is.inf) > 0, , drop = FALSE] ## extract candidate outliers
plot(fit, which = 4)

Here the biggest problem comes from dinosaurs, which do not seem to follow the same relationship than other animals.

Animals_nodino <- Animals[!rownames(Animals) %in%
                          c("Triceratops", "Brachiosaurus", "Dipliodocus"), ]
fit_nodino <- lm(log_brain ~ log_body, data = Animals_nodino)
plot(fit_nodino)
im <- influence.measures(fit_nodino)
im$infmat[rowSums(im$is.inf) > 0, , drop = FALSE] ## extract candidate outliers
plot(fit_nodino, which = 4)

That is not great, but much better!

Comparing to an allometry coefficient of 2/3

This time we will use a package to test the allometry coefficient.

summary(multcomp::glht(fit, linfct = c("log_body = 0.66666666667")))

Note: you cannot write 2/3 r .emo("broken")

Plotting the observed and expected allometric relationships

plot(log_brain ~ log_body, data = Animals, col = NULL)
text(x = Animals$log_body, y = Animals$log_brain,
     labels = rownames(Animals), cex = 0.8,
     col = ifelse( rownames(Animals) %in% c("Triceratops", "Brachiosaurus", "Dipliodocus"),
                   "red", "black"))
p <- predict(fit, newdata = data.frame("log_body" = range(log(Animals$body))))
p_nodino <- predict(fit_nodino, newdata = data.frame("log_body" = range(log(Animals$body))))
points(x = range(Animals$log_body), y = p, col = "red", type = "l", lty = 2)
points(x = range(Animals$log_body), y = p_nodino, col = "blue", type = "l")

Prediction of the brain size for an animal of 1kg

p <- predict(fit_nodino, newdata = data.frame(log_body = log(1)))
exp(p)

An animal of 1kg is predicted to have a brain size of r round(exp(p), 2) grams.

NOTE: Again, this is the same as the intercept because log(1) = 0.

exp(coefficients(fit_nodino)["(Intercept)"])

Let us rank the organisms by brain size controlling for body size

matrix(names(sort(resid(fit), decreasing = TRUE)), ncol = 1)

Clean up

We clean the session for the next exercise:

rm(list = ls())


Dataset: trees (difficulty = 4/5)

r .emo("goal") Compare the approximation of the volume of wood given by $\text{Volume} = c\times\text{Height}\times\text{Girth}^2$ (with $c$ to be estimated) to the approximation of a tree trunk as a cylinder.

Exploring the data

str(trees)
summary(trees)
pairs(trees)
any(is.na(trees))

Fitting the model (Solution 1)

A first solution is to create a new predictor containing the value of $\text{Height}\times\text{Girth}^2$ and then fit a model without intercept which will directly estimate $c$.

trees$HG2 <- trees$Height*trees$Girth^2
fit_trees1 <- lm(Volume ~ HG2 - 1, data = trees)

Checking the assumptions

There is no need to check our first 3 assumptions as the model structure is constrained, but we should check the assumptions concerning the errors.

par(mfrow = c(2, 2))
plot(fit_trees1)

Fitting the model (Solution 2)

Another solution is to use a transformation of the data based on logarithms.

Indeed, $\text{Volume} = c \times \text{Height} \times \text{Girth}^2$ is not linear but we can turn it into a linear expression by using logs:

$$\text{log(Volume)} = \text{log}(c) + \text{log(Height)} + 2 \times \text{log(Girth)}$$

We will thus substitute variables by their logs and try to fit the following model:

$$ \text{log_Volume} = k + 1 \times \text{log_Height} + 2 \times \text{log_Girth} + \epsilon$$

with $k = \text{log}(c)$. In this LM only $k$ must be estimated and the other slopes are considered as fixed. We thus fit the model as follow:

trees$log_Volume <- log(trees$Volume)
trees$log_Height <- log(trees$Height)
trees$log_Girth <- log(trees$Girth)
fit_trees2 <- lm(log_Volume ~ 1 + offset(log_Height + 2 * log_Girth), data = trees)

Checking the assumptions

par(mfrow = c(2, 2))
plot(fit_trees2)

It looks alright, let's not forget that we only have r nrow(trees) rows in the datasets.

Let us compare how each model fits the data

It is not clear how to compare the model since they are not nested and don't use the same response.

We will here use the r-squared although this is not a method to be recommended in general.

cor(predict(fit_trees1), trees$Volume)^2
cor(exp(predict(fit_trees2)), trees$Volume)^2

The goodness of fits look identical.

Estimation of the parameter $c$

The first model fit gives us directly an estimate for $c$:

coef(fit_trees1)
confint(fit_trees1)

The second fit gives us an estimate of $k$ (not $c$), which is:

coef(fit_trees2)
confint(fit_trees2)

Because $k$ corresponds to the log of $c$, we must raise the obtained value to the exponential to obtain the estimation of $c$:

exp(coef(fit_trees2))
exp(confint(fit_trees2))

We compare the two modelling approaches:

coef(fit_trees1)[[1]] / exp(coef(fit_trees2))[[1]]

Note: both estimates are very close.

Comparison to the approximation of a tree trunk by a cylinder

The volume of a cylinder is the height times the surface of a circle, so $h \pi r^2$ with $r$ the radius or $h\pi(d/2)^2$ with $d$ the diameter. The girth is here the diameter as mentioned in the help file and not the perimeter.

In the formula, all units should be the same but here the diameter is in inches but the height is measured in feet and the volume in cubic feet. This thus lead to the volume of the tree as a cylinder being approximated by $\text{Height} \times \pi \times \left( \text{conv}\text{in_to_ft} \text{Radius} \right)^2$. Since one foot is 12 inches, $\text{conv}\text{in_to_ft} = 1/12$, leading to $\text{Volume} = \text{Height} \times \pi \times \left(\frac{\text{Girth}}{24}\right)^2$.

trees$HPG2 <- trees$Height*pi*(trees$Girth/24)^2
fit_trees3 <- lm(Volume ~ -1 + offset(HPG2), data = trees)

We can compare both models explicitly:

anova(fit_trees3, fit_trees1)

fit_trees1 is a much better fit of the data than fig_trees3

Let us plot the differences in prediction between models:

girth.for.pred <- seq(min(trees$Girth), max(trees$Girth), length = 4)
height.for.pred <- seq(min(trees$Height), max(trees$Height), length = 10)
data.for.pred <- expand.grid(Girth = girth.for.pred, Height = height.for.pred)

data.for.pred$HG2 <- data.for.pred$Height*data.for.pred$Girth^2
data.for.pred$HPG2 <- data.for.pred$Height*pi*(data.for.pred$Girth/24)^2

pred1 <- predict(fit_trees1, newdata = data.for.pred)
pred2 <- predict(fit_trees3, newdata = data.for.pred)

data.for.plot <- as.data.frame(cbind(data.for.pred, fit1 = pred1, fit2 = pred2))

plot(Volume ~ Height, data = trees, cex = 5*trees$Girth/max(trees$Girth), pch = 21,
     ylim = range(c(data.for.plot$fit, data.for.plot$fit2, trees$Volume)),
     ylab = "Volume (cubic feet)", xlab = "Height (meters)", col = "brown", bg = "yellow", lwd = 3)

for (girth_class in 1:length(unique(girth.for.pred))) {
  with(subset(data.for.plot,
              data.for.plot$Girth == unique(girth.for.pred)[girth_class]), {
    points(fit1 ~ Height, type = "l", lwd = girth_class)
    points(fit2 ~ Height, type = "l", lwd = girth_class, col = "red")
  })
}

The plot shows the influence of height on the mean volume for 4 specific girths (r round(girth.for.pred, 2) inches). Thicker lines represent trees with larger girth. The black lines depict the predictions from the model fit_trees and the red ones are for the model fit_trees_cylinder. The raw data are represented by the brown and yellow dots. The diameter of these dots is proportional to the actual measured girth.

That the approximation of the volume of trees as cylinders over-estimate the actual volume may be due to the fact that the diameter of trees goes down as you climb up the tree...

Clean up

We clean the session:

rm(list = ls())




courtiol/LM2GLMM documentation built on July 3, 2022, 7:42 a.m.