In hankOlofs/linreg: Fitting Linear Regression Models

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(linreg)
library(caret)
library(mlbench)

The updated version of the linreg package provides a new function called ridgereg() that performs ridge regression. We are delighted to present the ridgereg function in this vignette and now we will show its functionality by an example. In the following example, we will create a predictive model for the BostonHousing data in the mlbench package. For the model training process we will use the caret package. Both packages are loaded in our workspace.

Now let's get started!

First, the BostonHousing data will be divided into a training set and a test set by using the caret package, where $80\%$ will be for training, and the remaining $20\%$ for testing. The response variable is medv (median value of owner-occupied homes in USD 1000's).

data("BostonHousing")
colnames(BostonHousing) <- make.names(colnames(BostonHousing))
set.seed(123)
trainIndex <- createDataPartition(BostonHousing$medv, p = .8, 
                                  list = FALSE, 
                                  times = 1)
train_data <- BostonHousing[ trainIndex,]
colnames(train_data) <- make.names(colnames(train_data))

test_data  <- BostonHousing[-trainIndex,]

print(length(test_data[,1]))

After dividing the data into a training and a test set, we will now fit two linear regression models to the BostonHousing training data using the caret package - one model consists of all continuous variables as predictors, and one model with only significant continuous variables using forward selection. Another package called leaps is required in order to perform forward selection.

library(leaps)
# Linear regression
lin <- train(medv ~ crim + zn + age + indus + chas + nox + rm + dis + rad + tax + ptratio + b + lstat + rm:lstat,
             data = train_data,
                 method = "lm")

# Linear Regression with Forward Selection
lin_forward <- train(medv ~ crim + zn + age + indus + chas + nox + rm + dis + rad + tax + ptratio + b + lstat + rm:lstat, 
                     data = train_data, 
                 method = "leapForward",
                 tuneGrid = data.frame(nvmax = 1:(ncol(train_data)-1)))

Then we evaluate the performance of the two linear regression models on the training data, and we can notice that they obtained similar estimations of the root mean squared error (RMSE), $R^2$, and the mean absolute error (MAE).

lin_pred <- predict(lin, train_data)
postResample(pred = lin_pred, obs = train_data$medv)

lin_forward_pred  <- predict(lin_forward , train_data)
postResample(pred = lin_forward_pred, obs = train_data$medv)

Further, we will fit a ridge regression model by using the ridgereg function, for different values of $\lambda$.

A 10-fold cross-validation on the training set will be used in order to find the best hyperparameter value for $\lambda$.

# Include own model in train()
# PLEASE FIND WHAT'S WRONG
modelInfo <- list(
  label = "Hyperparameter",
  library = "linreg",
  type = "Regression",
  parameters = data.frame(
    parameter = "lambda",
    class = "numeric",
    label = "Hyperparameter"
  ),
  grid = function(x, y, len = NULL, search = "grid") {
    grid <- expand.grid(lambda = seq(0, 3, length = 10))
  },
  fit = function(x, y, wts, param, lev, last, classProbs, ...) {
    ## ridgereg requires a data frame with predictors and response
    dat <-
      if (is.data.frame(x))
        x
    else
      as.data.frame(x)
    dat$medv <- y

    mod <- ridgereg(
      formula = (medv ~ crim + zn + age + indus + chas + nox + rm + dis + rad + tax + ptratio + b + lstat),
      data = dat,
      lambda = param$lambda
    )
  },
  predict = function(modelFit, newdata, submodels = NULL) {
    print(modelFit)
    newdata <- as.data.frame(newdata)
    predict(modelFit, newdata)
  },
  loop = NULL,
  prob = NULL,
  levels = NULL
)

set.seed(123)
rr <- train(
  x = train_data[, names(train_data) != "medv"],
  y = train_data$medv,
  data = train_data,
  method = modelInfo,
  trControl = trainControl(method = "repeatedcv",
                           repeats = 10)
)
rr