Home

/

GitHub

/

dajmcdon/ubc-stat406-labs

/

In dajmcdon/ubc-stat406-labs: Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

Instructions

Rename this document with your student ID (not the 10-digit number, your IU username, e.g. dajmcdon). Include your buddy in the author field if you are working together.
I have given you code to generate data and fit 4 different models to the data. You should run through the code line by line in the console.
Discuss the questions with your neighbors. Write down answers.

Generate data and fit models

generate.data = function(n, p=3){
  X = 5 + matrix(rnorm(3*n), n)
  beta = c(runif(p+1, -1,1))
  epsilon = rnorm(n)
  Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!!
  data.frame(Y,X)
}
set.seed(20200213)
n = 250
dat = generate.data(n)
formulae = lapply(
  c('Y~.', 
    'log(Y)~.',
    paste0('Y ~', paste(paste0('log(X',1:3,')'),collapse='+')),
    paste0('log(Y) ~', paste(paste0('log(X',1:3,')'),collapse='+'))), 
  as.formula)
all.the.models = lapply(formulae, function(x) lm(x, data=dat))

Make QQ plots

## Base R version
#par(mfrow = c(2,2))
#for(i in 1:4){
#  qqnorm(residuals(all.the.models[[i]]))
#  qqline(residuals(all.the.models[[i]]))
#}
library(tidyverse)
resids = as_tibble(
  sapply(all.the.models, residuals), .name_repair = ~paste0("model",1:4))
resids %>% pivot_longer(everything()) %>%
  ggplot(aes(sample=value)) + geom_qq() + geom_qq_line() + 
  facet_wrap(~name, 2, scales = 'free_y')

Calculate CV

cv.lm = function(mdl) mean(residuals(mdl)^2 / (1-hatvalues(mdl))^2)
sapply(all.the.models, cv.lm)

Questions to answer

(Note: we discussed these during class, so these answers are terse.)

Which of the 4 models is the correct one? Model 2.
What do you notice in the Q-Q plots? Which ones look ok? Why?

Models 2 and 4 look ok. Taking logs of $x$ doesn't have much effect, so even though the qq-plots are different, it's nearly impossible to tell.

Examine the hatvalues for the 4 different models. What do you notice?

sapply(all.the.models, hatvalues) %>%
  as_tibble(.name_repair = ~paste0("model",1:4)) %>%
  select(model1, model3) %>%
  ggplot(aes(model1,model3)) + geom_point(color="blue")

They are nearly the same, except they diverge a bit for larger values. The hat values are the diagonal of $X(X' X)^{-1} X'$. So as long as $X$ doesn't change much, the hat values don't change much. As mentioned above, the log transformation makes these change slightly, but not much.

Consider models 1 and 2. In these two cases, what is residuals(mdl) doing? Think about how the log transformation affects these two things.

For model 1, residuals(mdl)=y-yhat for model 2, residuals(mdl)= log(y)-yhat where yhat is on the log scale. So really, in this case residuals(mdl)=log(y)-log(yhat)=log(y/yhat).

Is it reasonable to compare the CV values for models 1 and 3 with those of models 2 and 4? Why or why not?

No. They are on different scales. One is measuring absolute errors, while the other is more like percent errors.

How should we decide which model to use? Note: This is a subtle issue without a correct answer in light of the previous question.

The CV measures are using different things. It is important to decide how do we measure errors. If the answer is relatively (as in with % errors), we should compare the values from lm.cv on models 2 and 4. And we'd need to recalculate relative errors on models 1 and 3. If the answer is absolutely (as in the distance on the number line between predictions and observations), we should compare the values from lm.cv on models 1 and 3 with recalculated values on models 2 and 4.

dajmcdon/ubc-stat406-labs documentation built on Aug. 18, 2020, 1:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dajmcdon/ubc-stat406-labs
Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

In dajmcdon/ubc-stat406-labs: Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

Instructions

Generate data and fit models

Make QQ plots

Calculate CV

Questions to answer

R Package Documentation

Browse R Packages

We want your feedback!

dajmcdon/ubc-stat406-labs Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

In dajmcdon/ubc-stat406-labs: Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

Instructions

Generate data and fit models

Make QQ plots

Calculate CV

Questions to answer

R Package Documentation

Browse R Packages

We want your feedback!

dajmcdon/ubc-stat406-labs
Tutorials and labs for UBC Stat 406 in the 2020-2021 online year