dajmcdon
). Include your buddy in the author field if you are working together.generate.data = function(n, p=3){ X = 5 + matrix(rnorm(3*n), n) beta = c(runif(p+1, -1,1)) epsilon = rnorm(n) Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!! data.frame(Y,X) } set.seed(20200213) n = 250 dat = generate.data(n) formulae = lapply( c('Y~.', 'log(Y)~.', paste0('Y ~', paste(paste0('log(X',1:3,')'),collapse='+')), paste0('log(Y) ~', paste(paste0('log(X',1:3,')'),collapse='+'))), as.formula) all.the.models = lapply(formulae, function(x) lm(x, data=dat))
## Base R version #par(mfrow = c(2,2)) #for(i in 1:4){ # qqnorm(residuals(all.the.models[[i]])) # qqline(residuals(all.the.models[[i]])) #} library(tidyverse) resids = as_tibble( sapply(all.the.models, residuals), .name_repair = ~paste0("model",1:4)) resids %>% pivot_longer(everything()) %>% ggplot(aes(sample=value)) + geom_qq() + geom_qq_line() + facet_wrap(~name, 2, scales = 'free_y')
cv.lm = function(mdl) mean(residuals(mdl)^2 / (1-hatvalues(mdl))^2) sapply(all.the.models, cv.lm)
Which of the 4 models is the correct one?
What do you notice in the Q-Q plots? Which ones look ok? Why?
Examine the hatvalues for the 4 different models. What do you notice?
Consider models 1 and 2. In these two cases, what is residuals(mdl)
doing? Think about how the log transformation affects these two things.
Is it reasonable to compare the CV values for models 1 and 3 with those of models 2 and 4? Why or why not?
How should we decide which model to use? Note: This is a subtle issue without a correct answer in light of the previous question.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.