Variable Selection Methods"

library(olsrr)
library(ggplot2)
library(gridExtra)
library(nortest)
library(goftest)

Introduction

Variable selection refers to the process of choosing the most relevant variables to include in a regression model. They help to improve model performance and avoid over fitting.

Before we explore stepwise selection methods, let us take a quick look at all/best subset regression. As they evaluate every possible variable combination, these methods are computationally intensive and may crash your system if used with a large set of variables. We have included them in the package purely for educational purpose.

All Possible Regression

All subset regression tests all possible subsets of the set of potential independent variables. If there are K potential independent variables (besides the constant), then there are $2^{k}$ distinct subsets of them to be tested. For example, if you have 10 candidate independent variables, the number of subsets to be tested is $2^{10}$, which is 1024, and if you have 20 candidate variables, the number is $2^{20}$, which is more than one million.

model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_step_all_possible(model)

Best Subset Regression

Select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R2 value or the smallest MSE, Mallow's Cp or AIC.

model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_step_best_subset(model)

Stepwise Selection

Stepwise regression is a method of fitting regression models that involves the iterative selection of independent variables to use in a model. It can be achieved through forward selection, backward elimination, or a combination of both methods. The forward selection approach starts with no variables and adds each new variable incrementally, testing for statistical significance, while the backward elimination method begins with a full model and then removes the least statistically significant variables one at a time.

Model

We will use the below model throughout this article except in the case of hierarchical selection. You can learn more about the data here.

model <- lm(y ~ ., data = surgical)
summary(model)

Model specification

Irrespective of the stepwise method being used, we have to specify the full model i.e. all the variabels/predictors under consideration as olsrr extracts the candidate variables for selection/elimination from the model specified.

Forward selection
# stepwise forward regression
ols_step_forward_p(model)
Backward elimination
# stepwise backward regression
ols_step_backward_p(model)

Criteria

The criteria for selecting variables may be one of the following:

Include/exclude variables

We can force variables to be included or excluded from the model at all stages of variable selection. The variables may be specified either by name or position in the model specified.

By name
ols_step_forward_p(model, include = c("age", "alc_mod"))
By index
ols_step_forward_p(model, include = c(5, 7))

Standardized output

All stepwise selection methods display standard output which includes:

# adjusted r-square 
ols_step_forward_adj_r2(model)

Visualization

Use the plot() method to visualize variable selection. It will display how the variable selection criteria changes at each step of the selection process along with the variable selected.

# adjusted r-square 
k <- ols_step_forward_adj_r2(model)
plot(k)

Verbose output

To view the detailed regression output at each stage of variable selection/elimination, set details to TRUE. It will display the following information at each step:

# adjusted r-square 
ols_step_forward_adj_r2(model, details = TRUE)

Progress

To view the progress in the variable selection procedure, set progress to TRUE. It will display the variable being selected/eliminated at each step until there are no more candidate variables left.

# adjusted r-square 
ols_step_forward_adj_r2(model, progress = TRUE)

Hierarchical selection

When using p values as the criterion for selecting/eliminating variables, we can enable hierarchical selection. In this method, the search for the most significant variable is restricted to the next available variable. In the below example, as liver_test does not meet the threshold for selection, none of the variables after liver_test are considered for further selection i.e. the stepwise selection ends as soon as it comes across a variable that does not meet the selection threshold. You can learn more about hierachichal selection here.

# hierarchical selection
m <- lm(y ~ bcs + alc_heavy + pindex + enzyme_test + liver_test + age + gender + alc_mod, data = surgical)
ols_step_forward_p(m, 0.1, hierarchical = TRUE)


Try the olsrr package in your browser

Any scripts or data that you put into this service are public.

olsrr documentation built on May 29, 2024, 12:35 p.m.