library(olsrr) library(ggplot2) library(gridExtra) library(nortest) library(goftest)
Variable selection refers to the process of choosing the most relevant variables to include in a regression model. They help to improve model performance and avoid over fitting.
Before we explore stepwise selection methods, let us take a quick look at all/best subset regression. As they evaluate every possible variable combination, these methods are computationally intensive and may crash your system if used with a large set of variables. We have included them in the package purely for educational purpose.
All subset regression tests all possible subsets of the set of potential independent variables. If there are K potential independent variables (besides the constant), then there are $2^{k}$ distinct subsets of them to be tested. For example, if you have 10 candidate independent variables, the number of subsets to be tested is $2^{10}$, which is 1024, and if you have 20 candidate variables, the number is $2^{20}$, which is more than one million.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars) ols_step_all_possible(model)
Select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R2 value or the smallest MSE, Mallow's Cp or AIC.
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars) ols_step_best_subset(model)
Stepwise regression is a method of fitting regression models that involves the iterative selection of independent variables to use in a model. It can be achieved through forward selection, backward elimination, or a combination of both methods. The forward selection approach starts with no variables and adds each new variable incrementally, testing for statistical significance, while the backward elimination method begins with a full model and then removes the least statistically significant variables one at a time.
We will use the below model throughout this article except in the case of hierarchical selection. You can learn more about the data here.
model <- lm(y ~ ., data = surgical) summary(model)
Irrespective of the stepwise method being used, we have to specify the full model i.e. all the variabels/predictors
under consideration as olsrr
extracts the candidate variables for selection/elimination from the model specified.
# stepwise forward regression ols_step_forward_p(model)
# stepwise backward regression ols_step_backward_p(model)
The criteria for selecting variables may be one of the following:
We can force variables to be included or excluded from the model at all stages of variable selection. The variables may be specified either by name or position in the model specified.
ols_step_forward_p(model, include = c("age", "alc_mod"))
ols_step_forward_p(model, include = c(5, 7))
All stepwise selection methods display standard output which includes:
# adjusted r-square ols_step_forward_adj_r2(model)
Use the plot()
method to visualize variable selection. It will display how the variable selection criteria
changes at each step of the selection process along with the variable selected.
# adjusted r-square k <- ols_step_forward_adj_r2(model) plot(k)
To view the detailed regression output at each stage of variable selection/elimination, set details
to TRUE
. It will
display the following information at each step:
# adjusted r-square ols_step_forward_adj_r2(model, details = TRUE)
To view the progress in the variable selection procedure, set progress
to TRUE
. It will display the variable
being selected/eliminated at each step until there are no more candidate variables left.
# adjusted r-square ols_step_forward_adj_r2(model, progress = TRUE)
When using p
values as the criterion for selecting/eliminating variables, we can enable hierarchical
selection. In this method, the search for the most significant variable is restricted to the next available
variable. In the below example, as liver_test
does not meet the threshold for selection, none of the
variables after liver_test
are considered for further selection i.e. the stepwise selection ends as soon
as it comes across a variable that does not meet the selection threshold. You can learn more about hierachichal
selection here.
# hierarchical selection m <- lm(y ~ bcs + alc_heavy + pindex + enzyme_test + liver_test + age + gender + alc_mod, data = surgical) ols_step_forward_p(m, 0.1, hierarchical = TRUE)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.