dajmcdon
). Include your buddy in the author field if you are working together.generate.data = function(n, p=3){ X = 5 + matrix(rnorm(3*n), n) beta = c(runif(p+1, -1,1)) epsilon = rnorm(n) Y = exp(beta[1] + X %*% beta[-1] + epsilon) ## NOTE THIS LINE!! data.frame(Y,X) } set.seed(20200213) n = 250 dat = generate.data(n) formulae = lapply( c('Y~.', 'log(Y)~.', paste0('Y ~', paste(paste0('log(X',1:3,')'),collapse='+')), paste0('log(Y) ~', paste(paste0('log(X',1:3,')'),collapse='+'))), as.formula) all.the.models = lapply(formulae, function(x) lm(x, data=dat))
## Base R version #par(mfrow = c(2,2)) #for(i in 1:4){ # qqnorm(residuals(all.the.models[[i]])) # qqline(residuals(all.the.models[[i]])) #} library(tidyverse) resids = as_tibble( sapply(all.the.models, residuals), .name_repair = ~paste0("model",1:4)) resids %>% pivot_longer(everything()) %>% ggplot(aes(sample=value)) + geom_qq() + geom_qq_line() + facet_wrap(~name, 2, scales = 'free_y')
cv.lm = function(mdl) mean(residuals(mdl)^2 / (1-hatvalues(mdl))^2) sapply(all.the.models, cv.lm)
(Note: we discussed these during class, so these answers are terse.)
Which of the 4 models is the correct one? Model 2.
What do you notice in the Q-Q plots? Which ones look ok? Why?
Models 2 and 4 look ok. Taking logs of $x$ doesn't have much effect, so even though the qq-plots are different, it's nearly impossible to tell.
sapply(all.the.models, hatvalues) %>% as_tibble(.name_repair = ~paste0("model",1:4)) %>% select(model1, model3) %>% ggplot(aes(model1,model3)) + geom_point(color="blue")
They are nearly the same, except they diverge a bit for larger values. The hat values are the diagonal of $X(X' X)^{-1} X'$. So as long as $X$ doesn't change much, the hat values don't change much. As mentioned above, the log transformation makes these change slightly, but not much.
residuals(mdl)
doing? Think about how the log transformation affects these two things.For model 1, residuals(mdl)=y-yhat
for model 2, residuals(mdl)= log(y)-yhat
where yhat is on the log scale. So really, in this case residuals(mdl)=log(y)-log(yhat)=log(y/yhat)
.
No. They are on different scales. One is measuring absolute errors, while the other is more like percent errors.
The CV measures are using different things. It is important to decide how do we measure errors. If the answer is relatively (as in with % errors), we should compare the values from lm.cv
on models 2 and 4. And we'd need to recalculate relative errors on models 1 and 3. If the answer is absolutely (as in the distance on the number line between predictions and observations), we should compare the values from lm.cv
on models 1 and 3 with recalculated values on models 2 and 4.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.