In kevinwang09/APES: Approximated Exhaustive Search for GLM

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

APproximated Exhaustive Search (APES) is a variable selection method for Generalised Linear Models (GLMs). The main motivation behind this work is to improve the speed of exhaustive variable selection for GLMs. The accompanying paper is Wang et. al. (2019).

Suppose we have simulated a data with 500 rows and 20 predictor variables, and we have fitted a logistic regression model against a binary response variable, y. This is what we term a "full" model, i.e. a GLM that uses all available predictor variables in the linear part of the model. We know, because we have generated the data ourselves, that some of these variables are the true data generating variables and looking at the model summary in this example, we see that these variables happen to have p-values less than 0.05. So far, so good! However, we can also expect to see some variables with p-values also below the 0.05 cut-off by chance. In this case, is it still reasonable to also include these variables in our final selected model?

library(APES)

set.seed(123)
n = 500
p = 20
k = 1:p
beta = c(1, -1, rep(0, p-2))
x = matrix(rnorm(n*p), ncol = p)
colnames(x) = paste0("X", 1:p)
y = rbinom(n = n, size = 1, prob = expit(x %*% beta))
data = data.frame(y, x)

## Fitting a full model 
model = glm(y ~ ., data = data, family = "binomial")
summary(model)

The question of which variable(s) a practitioner should select is very common. One of the most intuitive variable selection methods is to enumerate all possible combinations of variables and select the best corresponding model using a fit statistic. This is known as exhaustive variable selection. In total, there are $2^p$ possibilities to explore, with $p$ being the number of predictors. However, in performing an exhaustive variable selection for this data, we will be looking through $2^{20} = 1,048,576$ models! If each model takes one thousandth of a second to compute, then this exhaustive variable selection will take about 15 minutes to run --- seems too long!

Understanding APES

At its core, APES is designed to speed up this variable selection procedure. APES will first convert the full logistic model into a linear model. The reason for doing this is that the exhaustive variable selection can be performed computationally much faster in the linear model space. In addition, linear models can benefit from best-subset algorithms which can search for the best linear model without searching through all $2^p$ candidate models. By default, APES uses the leaps-and-bound algorithm (from the leaps package) and mixed integer optimisation (from the bestsubset package) algorithm as the best-subset algorithms of choice. After obtaining a set of best-fit models, APES will convert the results back to into logistic models and show the best model of each size. While the word "best" can be subjective depending on the data context, in our example, we define this word as the model with the smallest information criterion value with the most common choices being the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC).

It is important to note that APES only returns the best approximation to the exhaustive search, however, through extensive simulations in our paper, we have demonstrated that APES tends to always return an informative model that is close to a genuine exhaustive search.

APES package

The APES package is designed to provide a user-friendly interface to perform APES variable selection. The main function in the APES package is the apes function which accepts a glm object in R. This is typically the "full model" with all available predictors fitted.

apes_result = apes(model = model)
class(apes_result)

The apes function returns an apes class object. By default, APES prints the model selected by the AIC and BIC and the time taken. There are extra computed results in the apes object which a user might be interested in further exploring.

print(apes_result)
names(apes_result)

The most important output is apes_model_df, which is a data.frame/tibble of the best model of each model size computed by APES.

apes_result$apes_model_df

By default, apes will always include the intercept term in the variable selection, and thus the selected models are always of size two (one intercept and one variable) and above. You may also be interested in plotting apes_result using the generic function plot

plot(apes_result)

More examples

The example above is a simple simulation to illustrate the basic underlying theory of APES and the apes object. Please see a birth weight data example and a diabetes data example where we will demonstrate the performance of APES using bootstrap sampling to assess model stability.