In dajmcdon/ubc-stat406-labs: Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

library(learnr)
library(gradethis)

tutorial_options(exercise.timelimit = 5, exercise.checker = gradethis::grade_learnr)

knitr::opts_chunk$set(echo = FALSE)

Data Generation

In this activity, we'll be exploring log transformations of predictors in the context of linear regression.

To begin, let's use the below function to generate a dataset of $Y, X_1, X_2$, and $X_3$.

generate.data = function(n, p=3){
  X = 5 + matrix(rnorm(3*n), n)
  beta = c(runif(p+1, -1,1))
  epsilon = rnorm(n)
  Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!!
  data.frame(Y,X)
}

Look at this function carefully.

quiz(caption="Question 1",
  question("Based on the data generation function, which of the following is the correct model?",
    answer("$Y \\sim \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\beta_3 X_3$"),
    answer("$\\log(Y) \\sim \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\beta_3 X_3$",correct=TRUE),
    answer("$Y \\sim \\beta_0 + \\beta_1 \\log(X_1) + \\beta_2 \\log(X_2) + \\beta_3 \\log(X_3)$"),
    answer("$\\log(Y) \\sim \\beta_0 + \\beta_1 \\log(X_1) + \\beta_2 \\log(X_2) + \\beta_3 \\log(X_3)$"),
    allow_retry = TRUE
  )
)

QQplot

Performing Linear Regression

The following code performs linear regression on the data after making various log transformations.

generate.data = function(n, p=3){
  X = 5 + matrix(rnorm(3*n), n)
  beta = c(runif(p+1, -1,1))
  epsilon = rnorm(n)
  Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!!
  data.frame(Y,X)
}

set.seed(20200706)
n = 250
dat = generate.data(n)
formulae = lapply(
  c('Y~.', 
    'log(Y)~.',
    paste0('Y ~', paste(paste0('log(X',1:3,')'),collapse='+')),
    paste0('log(Y) ~', paste(paste0('log(X',1:3,')'),collapse='+'))), 
  as.formula)
all.the.models = lapply(formulae, function(x) lm(x, data=dat))
all.the.models

Run this code and observe the output. You should see 4 different sets of coefficients from 4 different linear regression models. Now look back at the code carefully and determine which model is the correct one.

quiz(caption="Question 2",
  question("Which of the sets of coefficients was computed from the true model?",
    answer("Model 1"),
    answer("Model 2",correct=TRUE),
    answer("Model 3"),
    answer("Model 4"),
    allow_retry = TRUE
  )
)

QQ-Plots

The following code creates QQplots from the residuals of each of the four models.

generate.data = function(n, p=3){
  X = 5 + matrix(rnorm(3*n), n)
  beta = c(runif(p+1, -1,1))
  epsilon = rnorm(n)
  Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!!
  data.frame(Y,X)
}
set.seed(20200706)
n = 250
dat = generate.data(n)
formulae = lapply(
  c('Y~.', 
    'log(Y)~.',
    paste0('Y ~', paste(paste0('log(X',1:3,')'),collapse='+')),
    paste0('log(Y) ~', paste(paste0('log(X',1:3,')'),collapse='+'))), 
  as.formula)
all.the.models = lapply(formulae, function(x) lm(x, data=dat))

## Base R version
#par(mfrow = c(2,2))
#for(i in 1:4){
#  qqnorm(residuals(all.the.models[[i]]))
#  qqline(residuals(all.the.models[[i]]))
#}
library(tidyverse)
resids = as_tibble(
  sapply(all.the.models, residuals), .name_repair = ~paste0("model",1:4))
resids %>% pivot_longer(everything()) %>%
  ggplot(aes(sample=value)) + geom_qq() + geom_qq_line() + 
  facet_wrap(~name, 2, scales = 'free_y')

Look carefully at these QQplots. What do you see? What would you expect to see?

quiz(caption="Question 3",
  question("Which of the QQplots appear ok? (select ALL that apply)",
    answer("Model 1"),
    answer("Model 2",correct=TRUE),
    answer("Model 3"),
    answer("Model 4",correct=TRUE),
    allow_retry = TRUE
  )
)

We would expect the residuals to follow the line fairly closely. Models 2 and 4 look ok. It appears that taking the logs of $x$ doesn't have much effect, so even though the qq-plots are different, it's nearly impossible to tell.

Hat Values

Computing the Hat values

The following code generates the hat values, $h_{ii}$, for each of the models.

generate.data = function(n, p=3){
  X = 5 + matrix(rnorm(3*n), n)
  beta = c(runif(p+1, -1,1))
  epsilon = rnorm(n)
  Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!!
  data.frame(Y,X)
}
set.seed(20200706)
n = 250
dat = generate.data(n)
formulae = lapply(
  c('Y~.', 
    'log(Y)~.',
    paste0('Y ~', paste(paste0('log(X',1:3,')'),collapse='+')),
    paste0('log(Y) ~', paste(paste0('log(X',1:3,')'),collapse='+'))), 
  as.formula)
all.the.models = lapply(formulae, function(x) lm(x, data=dat))

hats <- sapply(all.the.models, hatvalues)
colnames(hats) <- c("Model 1", "Model 2", "Model 3", "Model 4")
pairs(hats)

Examine the hatvalues for the 4 different models. What do you notice?

Models 1 and 2 use the same $X$ so the hat values are identical. Same can be said for Models 3 and 4 that both use the log transformation of $X$. When looking at the other model comparison, they are nearly the same, except they diverge a bit for larger values. Remember, the hat values are the diagonals of $X(X' X)^{-1} X'$. So as long as $X$ doesn't change much, the hat values don't change much. The log transformation changes $X$ very slightly for small values, but has more of an impact on larger values.

Cross-Validation

Residuals

The following code performs cross-validation for each of the four models.

generate.data = function(n, p=3){
  X = 5 + matrix(rnorm(3*n), n)
  beta = c(runif(p+1, -1,1))
  epsilon = rnorm(n)
  Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!!
  data.frame(Y,X)
}
set.seed(20200706)
n = 250
dat = generate.data(n)
formulae = lapply(
  c('Y~.', 
    'log(Y)~.',
    paste0('Y ~', paste(paste0('log(X',1:3,')'),collapse='+')),
    paste0('log(Y) ~', paste(paste0('log(X',1:3,')'),collapse='+'))), 
  as.formula)
all.the.models = lapply(formulae, function(x) lm(x, data=dat))

cv.lm = function(mdl) mean(residuals(mdl)^2 / (1-hatvalues(mdl))^2)
sapply(all.the.models, cv.lm)

Consider models 1 and 2. In these two cases, what is residuals(mdl) doing? Think about how the log transformation affects these two things.

For model 1, residuals(mdl)=y-yhat for model 2, residuals(mdl)= log(y)-yhat where yhat is on the log scale. So really, in this case residuals(mdl)=log(y)-log(yhat)=log(y/yhat).

Is it reasonable to compare the CV values for models 1 and 3 with those of models 2 and 4? Why or why not?

quiz(caption="Question 4",
  question("Is it reasonable to compare the CV values for models 1 and 3 with those of models 2 and 4? Why or why not?",
    answer("Yes, Models 2 and 4 have a much lower CV value and are thus better."),
    answer("Yes, Models 1 and 3 have a much higher CV value and are thus better."),
    answer("No, the residuals are on different scales so it is unfair to compare", correct=TRUE, message="One is measuring absolute errors, while the other is more like percent errors."),
    answer("Yes, performing cross-validation corrects for any mismatch in scaling."),
    allow_retry = TRUE
  )
)

Selecting a Model

How should we decide which model to use?

This is a subtle issue without a correct answer in light of the previous question. The CV measures are using different things. It is important to decide how do we measure errors. If the answer is relatively (as in with % errors), we should compare the values from lm.cv on models 2 and 4. And we'd need to recalculate relative errors on models 1 and 3. If the answer is absolutely (as in the distance on the number line between predictions and observations), we should compare the values from lm.cv on models 1 and 3 with recalculated values on models 2 and 4.

dajmcdon/ubc-stat406-labs documentation built on Aug. 18, 2020, 1:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com