This is Lab 9 on Data Mining. It is a simple introduction to the practicalities of linear regression.
First we load the datamining
package and the data for the lab.
library(datamining) load('lab9.rda')
The data set is 196 weeks of grocery sales for a store. The variables are:
| Variable | Description | |----------|------------------------------------------------| | Price.1 | DOLE PINEAPPLE ORANG 64 OZ | | Price.2 | FIVE ALIVE CTRUS BEV 64 OZ | | Price.3 | HH FRUIT PUNCH 64 OZ | | Price.4 | HH ORANGE JUICE 64 OZ | | Price.5 | MIN MAID O J CALCIUM 64 OZ | | Price.6 | MIN MAID O J PLASTIC 96 OZ | | Price.7 | MM PULP FREE OJ 64 OZ | | Price.8 | SUNNY DELIGHT FLA CI 64 OZ | | Price.9 | TREE FRESH O J REG 64 OZ | | Price.10 | TROP PURE PRM HOMEST 64 OZ | | Price.l1 | TROP SB HOMESTYLE OJ 64 OZ | | Sold.4 | Number of units sold for HH ORANGE JUICE 64 OZ |
It is stored in a data.frame
called x
. We look at the first few rows of the data.frame
using the head
function.
head(x)
First we make a copy of x
and put the response in the first column of the data.frame
. Then we transform the variable Sold.4
so that its distribution is more evenly distributed. Next we standardise all variables to have zero mean and unit variance using the scale
function.
xx <- x[, c(ncol(x), 1:ncol(x)-1)] hist(xx$Sold.4) xx$Sold.4 <- log(xx$Sold.4) xx <- as.data.frame(scale(xx)) hist(xx$Sold.4)
Let us construct a linear model to predict Sold.4
as a function of Price.
fit <- lm(Sold.4 ~ ., data=xx) summary(fit)
Then we plot all the predictors versus the residuals of this model.
predict_plot(fit, xx)
From the p-values, the important predictors appear to be Price.4
, Price.5
, Price.7
and Price.11
. The plots of residuals confirm that there are no anomalies in the fit.
We pick one of the important predictors from the last step and include it in a new model.
fit_4 <- lm(Sold.4 ~ Price.4, data=xx) summary(fit_4)
predict_plot(fit_4, xx)
The same predictors as before seem important with Price.11
the most important.
Next we add Price.11
to the model.
fit_4_11 <- lm(Sold.4 ~ Price.4 + Price.11, data=xx) summary(fit_4_11)
predict_plot(fit_4_11, xx)
We keep adding predictors until no more predictors seem useful.
fit_4_11_5_7 <- lm(Sold.4 ~ Price.4 + Price.11 + Price.5 + Price.7, data=xx) summary(fit_4_11_5_7)
Starting again from a model with only Price.4
, we use step to add predictors.
fit_step <- step(fit_4, formula(xx))
The automatically generated model is the same as that generated manually.
Finally we make a partial residual plot for the model from step
.
predict_plot(fit_step, partial=TRUE)
Residual and partial residual plots are also available in the function termplot
in the stats
package.
library(stats) par(mfrow = c(2, 2)) termplot(fit_step, partial.resid = TRUE, ask = FALSE) par(mfrow = c(1, 1))
Note that the partial residuals for Sold.4
are shown using predict_plot
whereas termplot
uses the partial residuals for each predictor on the vertical axis.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.