knitr::opts_chunk$set(echo = TRUE, fig.align = "center",fig.width = 6)
On page 627 of MS we have the account of the cost of a road construction awarded using the sealed bid system.
We will first investigate the predictor variables.
We will remove the STATUS
variable since it is qualitative and we wish to assess quantitative correlation using cor
.
library(Intro2MLR) library(dplyr) data(flag) nms <- names(flag) cflag <- cor(flag) cflag
This is helpful in many ways -- we note that some of the quantitative variables are highly correlated and that the response COST
is correlated to a number of predictors.
as.data.frame(cflag) %>% filter(COST>0.7 & COST < 1)
Even better we can make a correlation plot
library(corrplot) corrplot(cflag)
We will remove DAYSEST
and retain DOTEST
. We need to investigate the relationship of the response to the qualitative variable STATUS
library(ggplot2) flag <- flag %>% mutate(STATUS <- factor(STATUS)) g <- ggplot(flag, aes(x = STATUS, y = COST)) + geom_point(aes(col = COST)) g
The above plot shows that the mean cost changes as the levels (0 -> 1) change. We will therefore include the STATUS
qualitative variable.
We will trial the following model
$$ COST \sim DOTEST + STATUS + STATUS:DOTEST $$
ylm <- lm(COST ~ DOTEST + STATUS + STATUS:DOTEST, data = flag) summary(ylm)
We will need to investigate the validity of the model before progressing - the model as it is shows adequacy.
Check on normality $H_0: \epsilon \sim N$
library(s20x) normcheck(ylm, shapiro.wilk = TRUE)
The residuals give evidence against normality.
plot(ylm, which = 1)
This is sad! Notice the fan!!!
We will try transforming the response.
ylm1 <- lm(log(COST) ~ DOTEST + STATUS + STATUS:DOTEST, data = flag) plot(ylm1, which = 1)
This worked on the fan but now we have a trend in the residuals.
We will try transforming the predictor
ylm2 <- lm(log(COST) ~ log(DOTEST) + STATUS + STATUS:I(log(DOTEST)), data = flag) plot(ylm2, which = 1)
As can be seen by the plot of residuals versus fitted values there is no trend and points are evenly scattered about the horizontal axis.
Cool!!
We may find that the model needs to be simplified now.
summary(ylm2)
Sure enough we can remove the interaction term
ylm3 <- lm(log(COST) ~ log(DOTEST) + STATUS , data = flag)
Now we will recheck normality
normcheck(ylm3, shapiro.wilk = TRUE)
The QQ plot and Normal plot alongside are better. We still have evidence against the Normal assumption but the model has improved.
Because we have transformed the response we will need to back transform our predictions to get answers on the original scale.
summary(ylm3)
From the output we can write:
$$\widehat{log(COST)} = -0.147 + 1.01 log(DOTEST) + 0.217 STATUS$$
Suppose we wish to predict the road construction cost when the DOT estimate is \$370,000 this means DOTEST=370
and contract is fixed STATUS=1
Then you can use the above equation or use predict
ans = predict(ylm3, data.frame(DOTEST=370, STATUS = 1), interval = "none") ans
This is on the log scale so on the original scale we will have
exp(ans)
So the cost for a fixed road contract with DOT estimate of \$370,000 will be \$r exp(ans)
thousand dollars.
This is not that satisfying statistically however, since a point estimate without and interval is not even half the story.
The prediction interval for a new response is given below with the transformed 95\% interval.
ans1 = predict(ylm3, data.frame(DOTEST=370, STATUS = 1), interval = "prediction") ans1 exp(ans1)
You should understand the subdivision of the TSS in terms of component RSS differences in sub models.
anova(ylm3) r<-residuals(ylm3) sum(r^2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.