Instructions

  1. Rename this document with your student ID (not the 10-digit number, your IU username, e.g. dajmcdon). Include your buddy in the author field if you are working together.
  2. I have given you code to generate data and fit 4 different models to the data. You should run through the code line by line in the console.
  3. Discuss the questions with your neighbors. Write down answers.

Generate data and fit models

generate.data = function(n, p=3){
  X = 5 + matrix(rnorm(3*n), n)
  beta = c(runif(p+1, -1,1))
  epsilon = rnorm(n)
  Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!!
  data.frame(Y,X)
}
set.seed(20200213)
n = 250
dat = generate.data(n)
formulae = lapply(
  c('Y~.', 
    'log(Y)~.',
    paste0('Y ~', paste(paste0('log(X',1:3,')'),collapse='+')),
    paste0('log(Y) ~', paste(paste0('log(X',1:3,')'),collapse='+'))), 
  as.formula)
all.the.models = lapply(formulae, function(x) lm(x, data=dat))

Make QQ plots

## Base R version
#par(mfrow = c(2,2))
#for(i in 1:4){
#  qqnorm(residuals(all.the.models[[i]]))
#  qqline(residuals(all.the.models[[i]]))
#}
library(tidyverse)
resids = as_tibble(
  sapply(all.the.models, residuals), .name_repair = ~paste0("model",1:4))
resids %>% pivot_longer(everything()) %>%
  ggplot(aes(sample=value)) + geom_qq() + geom_qq_line() + 
  facet_wrap(~name, 2, scales = 'free_y')

Calculate CV

cv.lm = function(mdl) mean(residuals(mdl)^2 / (1-hatvalues(mdl))^2)
sapply(all.the.models, cv.lm)

Questions to answer

(Note: we discussed these during class, so these answers are terse.)

  1. Which of the 4 models is the correct one? Model 2.

  2. What do you notice in the Q-Q plots? Which ones look ok? Why?

Models 2 and 4 look ok. Taking logs of $x$ doesn't have much effect, so even though the qq-plots are different, it's nearly impossible to tell.

  1. Examine the hatvalues for the 4 different models. What do you notice?
sapply(all.the.models, hatvalues) %>%
  as_tibble(.name_repair = ~paste0("model",1:4)) %>%
  select(model1, model3) %>%
  ggplot(aes(model1,model3)) + geom_point(color="blue")

They are nearly the same, except they diverge a bit for larger values. The hat values are the diagonal of $X(X' X)^{-1} X'$. So as long as $X$ doesn't change much, the hat values don't change much. As mentioned above, the log transformation makes these change slightly, but not much.

  1. Consider models 1 and 2. In these two cases, what is residuals(mdl) doing? Think about how the log transformation affects these two things.

For model 1, residuals(mdl)=y-yhat for model 2, residuals(mdl)= log(y)-yhat where yhat is on the log scale. So really, in this case residuals(mdl)=log(y)-log(yhat)=log(y/yhat).

  1. Is it reasonable to compare the CV values for models 1 and 3 with those of models 2 and 4? Why or why not?

No. They are on different scales. One is measuring absolute errors, while the other is more like percent errors.

  1. How should we decide which model to use? Note: This is a subtle issue without a correct answer in light of the previous question.

The CV measures are using different things. It is important to decide how do we measure errors. If the answer is relatively (as in with % errors), we should compare the values from lm.cv on models 2 and 4. And we'd need to recalculate relative errors on models 1 and 3. If the answer is absolutely (as in the distance on the number line between predictions and observations), we should compare the values from lm.cv on models 1 and 3 with recalculated values on models 2 and 4.



dajmcdon/ubc-stat406-labs documentation built on Aug. 18, 2020, 1:23 p.m.