In yw0817/BIOS625_HW4: Linear Regression Model and Stepwise Feature Selection

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

install.packages("bench", repos="http://cran.us.r-project.org")
install.packages("MASS", repos="http://cran.us.r-project.org")
library(LinReg.Select)

LinearReg

When fitting a linear regression model, it will also automatically generate a summary for the result of the linear regression:

my.lm = LinearReg(mpg ~ cyl + disp + hp, data = mtcars)

Some features, like residuals, fitted values, R square, adjusted R square, etc., can also be retrieved from the function:

plot(my.lm$fitted.values, my.lm$residuals)
my.lm$`Multiple R-squared`
my.lm$`Adjusted R-squared`

The function can also take interaction terms or terms with power:

my.lm2 = LinearReg(mpg ~ cyl + disp + hp + cyl*disp + I(hp^2), data = mtcars)

To fit linear regression model without intercept:

my.lm3 = LinearReg(mpg ~ -1 + cyl + disp + hp + cyl*disp, data = mtcars)

To check the result of my linear regression function with the reference function lm:

ref = lm(mpg ~ cyl + disp + hp, data = mtcars)
my.result = LinearReg(mpg ~ cyl + disp + hp, data = mtcars)

all.equal(ref$coefficients, my.result$coefficients)
all.equal(ref$fitted.values, my.result$fitted.values)
all.equal(ref$residuals, my.result$residuals)
all.equal(summary(ref)$r.squared, my.result$`Multiple R-squared`)
all.equal(summary(ref)$adj.r.squared, my.result$`Adjusted R-squared`)

The results of coefficients, fitted values, residuals, multiple R square, and adjusted R square are all the same.

To compare the efficiency of my linear regression function with reference function lm:

result1 = bench::mark(summary(lm(mpg ~ cyl + disp + hp, data = mtcars))$r.squared, LinearReg(mpg ~ cyl + disp + hp, data = mtcars)$`Multiple R-squared`)
print(result1)

result2 = bench::mark(summary(lm(mpg ~ cyl + disp + hp, data = mtcars))$adj.r.squared, LinearReg(mpg ~ cyl + disp + hp, data = mtcars)$`Adjusted R-squared`)
print(result2)

They are both pretty fast but the reference function lm is faster than my linear regression function.

StepSelect

Backward Selection

To conduct backward selection:

full_lm = lm(mpg ~ ., data = mtcars)
StepSelect(full_lm, direction = "backward")

k is the penalty used in AIC, the default is 2 but it can be defined by users:

n = dim(mtcars)[1]
StepSelect(full_lm, direction = "backward", k = log(n))

Users can also trace the whole process by turning on the trace option:

StepSelect(full_lm, direction = "backward", k = log(n), trace = TRUE)

To check the result of my backward selection function with the reference function stepAIC:

ref = MASS::stepAIC(full_lm, direction = "backward", k = log(n), trace = FALSE)
my = StepSelect(full_lm, direction = "backward", k = log(n), trace = FALSE)
all.equal(ref$coefficients, my$coefficients)
all.equal(ref$call$formula, my$call$formula)

Comparing the final model and model coefficients, my backward selection function is the same as the reference function in R, stepAIC.

To compare the efficiency of my backward selection function with stepAIC function:

result = bench::mark(MASS::stepAIC(full_lm, direction = "backward", k = log(n), trace = FALSE)$coefficients, StepSelect(full_lm, direction = "backward", k = log(n), trace = FALSE)$coefficients, )
print(result)

The reference function is only 20ms faster than my backward selection function.

Forward Selection

To conduct forward selection:

StepSelect(full_lm, direction = "forward")

Similarly, users can also define k and trace:

StepSelect(full_lm, direction = "forward", k = log(n), trace = TRUE)

To check the result of my backward selection function with the reference function stepAIC:

m = lm(mpg ~ 1, data = mtcars)
ref2 = MASS::stepAIC(m, direction="forward", k = log(n), scope=list(lower=m, upper=full_lm), trace = FALSE)
my2 = StepSelect(full_lm, direction = "forward", k = log(n), trace = FALSE)
all.equal(ref2$coefficients, my2$coefficients)
all.equal(ref2$call$formula, my2$call$formula)

Comparing the coefficients and final model from forward selection, my function is the same as the reference function stepAIC.

To compare the efficiency of my forward selection function with stepAIC function:

result = bench::mark(MASS::stepAIC(m, direction="forward", k = log(n), scope=list(lower=m, upper=full_lm), trace = FALSE)$coefficients, my2 = StepSelect(full_lm, direction = "forward", k = log(n), trace = FALSE)$coefficients)
print(result)