Introduction

This is Lab 9 on Data Mining. It is a simple introduction to the practicalities of linear regression.

First we load the datamining package and the data for the lab.

library(datamining)
load('lab9.rda')

The data set

The data set is 196 weeks of grocery sales for a store. The variables are:

| Variable | Description | |----------|------------------------------------------------| | Price.1 | DOLE PINEAPPLE ORANG 64 OZ | | Price.2 | FIVE ALIVE CTRUS BEV 64 OZ | | Price.3 | HH FRUIT PUNCH 64 OZ | | Price.4 | HH ORANGE JUICE 64 OZ | | Price.5 | MIN MAID O J CALCIUM 64 OZ | | Price.6 | MIN MAID O J PLASTIC 96 OZ | | Price.7 | MM PULP FREE OJ 64 OZ | | Price.8 | SUNNY DELIGHT FLA CI 64 OZ | | Price.9 | TREE FRESH O J REG 64 OZ | | Price.10 | TROP PURE PRM HOMEST 64 OZ | | Price.l1 | TROP SB HOMESTYLE OJ 64 OZ | | Sold.4 | Number of units sold for HH ORANGE JUICE 64 OZ |

It is stored in a data.frame called x. We look at the first few rows of the data.frame using the head function.

head(x)

Standardising

First we make a copy of x and put the response in the first column of the data.frame. Then we transform the variable Sold.4 so that its distribution is more evenly distributed. Next we standardise all variables to have zero mean and unit variance using the scale function.

xx <- x[, c(ncol(x), 1:ncol(x)-1)]
hist(xx$Sold.4)
xx$Sold.4 <- log(xx$Sold.4)
xx <- as.data.frame(scale(xx))
hist(xx$Sold.4)

Adding predictors

Let us construct a linear model to predict Sold.4 as a function of Price.

fit <- lm(Sold.4 ~ ., data=xx)
summary(fit)

Then we plot all the predictors versus the residuals of this model.

predict_plot(fit, xx)

From the p-values, the important predictors appear to be Price.4, Price.5, Price.7 and Price.11. The plots of residuals confirm that there are no anomalies in the fit.

We pick one of the important predictors from the last step and include it in a new model.

fit_4 <- lm(Sold.4 ~ Price.4, data=xx)
summary(fit_4)
predict_plot(fit_4, xx)

The same predictors as before seem important with Price.11 the most important.

Next we add Price.11 to the model.

fit_4_11 <- lm(Sold.4 ~ Price.4 + Price.11, data=xx)
summary(fit_4_11)
predict_plot(fit_4_11, xx)

We keep adding predictors until no more predictors seem useful.

fit_4_11_5_7 <- lm(Sold.4 ~ Price.4 + Price.11 + Price.5 + Price.7, data=xx)
summary(fit_4_11_5_7)

Automatic selection

Starting again from a model with only Price.4, we use step to add predictors.

fit_step <- step(fit_4, formula(xx))

The automatically generated model is the same as that generated manually.

Finally we make a partial residual plot for the model from step.

predict_plot(fit_step, partial=TRUE)

Residual and partial residual plots are also available in the function termplot in the stats package.

library(stats)
par(mfrow = c(2, 2))
termplot(fit_step, partial.resid = TRUE, ask = FALSE)
par(mfrow = c(1, 1))

Note that the partial residuals for Sold.4 are shown using predict_plot whereas termplot uses the partial residuals for each predictor on the vertical axis.



paulemms/datamining documentation built on March 1, 2023, 4:01 p.m.