Introduction

Feature selection or feature pruning is the removal of some proposed columns or potential explanatory variables from a model. It can be an important step in data science, where many available data columns are not in fact pre-screened as suitable for a given project. In fact one could say that what a statistician calls an explanatory variable has already been filtered for problem appropriateness, so you sometimes don't see variable selection in statistics because we are often starting statistical stories after the variable qualification chapter!

We recommend incorporating mild feature pruning into your machine learning and modeling workflows. It is empirically known that feature selection can help (for example here and here). It is also known one can introduce problems if feature selection is done improperly (for example Freedman's paradox. The issues are avoidable, by using out of sample methods, cross methods, or differential privacy). We have also always felt a bit exposed in this, as feature selection seems unjustified in standard explanations of regression. One feels that if a coefficient were meant to be zero, the fitting procedure would have set it to zero. Under this misapprehension, stepping in and removing some variables feels unjustified.

Regardless of intuition or feelings, it is a fair question: is variable selection a natural justifiable part of modeling? Or is it something that is already done (therefore redundant). Or is it something that is not done for important reasons (such as avoiding damaging bias)?

In this note we will show that feature selection is in fact an obvious justified step when using a sufficiently sophisticated model of regression. This note is long, as it defines so many tiny elementary steps. However this note ends with a big point: variable selection is justified. It naturally appears in the right variation of Bayesian Regression. You should select variables, using your preferred methodology. And you shouldn't feel bad about selecting variables.

Bayesian Regression in Action

To work towards variable selection/pruning, let's work a specific Bayesian Regression example in R. We are going to use the direct generative method described in the long version of this article, instead of a package such as rstan or lme4. This is because:

First we prepare our R environment.

library(wrapr)
library(ggplot2)
source('fns.R')

set.seed(2020)

cl <- NULL
# cl <- parallel::makeCluster(parallel::detectCores())

The source to fns.R and the mathematical derivations of all the steps can be found in the long-version of this article here.

Now we define our functions. The first generates our proposed parameters beta_0 and beta_1. The idea is that since the function generates the vector parameters according to the normal distribution, picking one of them uniformly from the stored data.frame is also normally distributed.

generate_params <- function(
  ...,  # force arguments to bind by name
  n_sample = 100000,
  prior_mean_beta_0 = 0,
  prior_sd_beta_0 = 1,
  prior_mean_beta_1 = 0,
  prior_sd_beta_1 = 1,
  prior_explanatory_prob = 0.5) {
  stop_if_dot_args(
    substitute(list(...)), 
    "generate_params")
  params <- data.frame(
    beta_1_explanatory = rbinom(
      n = n_sample, 
      size = 1, prob = 
        prior_explanatory_prob),
    beta_0 = rnorm(
      n = n_sample, 
      mean = prior_mean_beta_0, 
      sd = prior_sd_beta_0),
    beta_1 = rnorm(
      n = n_sample, 
      mean = prior_mean_beta_1, 
      sd = prior_sd_beta_1),
    prior_weight = 1/n_sample
  )
  return(params)
}

beta_1_explanatory is a random variable that is always 0 or 1. We will use it later to perform variable selection. For now it is not used.

Now we define functions to compute P[data | parameters]. We want to know P[parameters | data], but we know from Bayes' Law that if we can estimate P[data | parameters], then we can derive the desired P[parameters | data].

# per setting parameter calculation function
mk_fun_p_data_given_params <- function(params, data, sd_noise) {
    force(params)
    force(data)
    force(sd_noise)
    function(i) {
      pi <- params[i, , drop = FALSE]
      ei <- data$y - (pi$beta_1 * data$x  + pi$beta_0)
      sum(dnorm(ei, mean=0, sd = sd_noise, log = TRUE))
    }
}

mk_fun_p_data_given_params realizes the following probability model: parameters are plausible if data$y - (pi$beta_1 * data$x + pi$beta_0) is small. All of the definitions and math are to justify this block of code. mk_fun_p_data_given_params embodies the generative model illustrated in the following dependency diagram. The only important bit is the statement ei <- data$y - (pi$beta_1 * data$x + pi$beta_0).

With this model defined we can run some experiments.

Experiment 1: Standard Bayesian Inference with beta_1 explanatory.

We can now run our first experiment.

First we generate our observed data d$x and d$y.

n_example <- 100
actual_b1 <- 0.7
actual_b0 <- 0.3

d <- data.frame(x = rnorm(n_example))
d$y <- actual_b1 * d$x + actual_b0 + rnorm(nrow(d), mean = 0, sd = 1)

Notice d$y is 0.7 * d$x + 0.3 with mean zero, standard deviation one noise. A good fitting procedure should recover the coefficients beta_0 = 0.3 and beta_1 = 0.7.

Our Bayesian fitting procedure is as follows.

First: propose many possible values of the fit coefficients beta_0, beta_1, each with probability according to the assumed prior distributions.

params <- generate_params()

Second: compute the posterior plausibility of all of the coefficient proposals with respect to the observed data. This is just us using Bayes' Law to criticize all of the proposed coefficient combinations. The least criticized is considered the best.

params$posterior_weight <- p_data_given_params(
  params = params,
  data = d,
  fn_factory = mk_fun_p_data_given_params,
  cl = cl)

This sampling or inference process can be summarized with the following diagram.

In Bayesian methods: having access to a description of our samples from the posterior distribution is pretty much the same as inferring the parameters or solving the problem. This may same alien or weird; but it is why we consider the problem solved if we can get an estimate or approximation of the posterior distribution.

And we now have our coefficient estimates.

For beta_0 we have:

plot_parameter(
  params, 'beta_0', 
  subtitle = 'regular model, beta_1 explanatory')

The above plot is two facets or panels. In each facet the x-axis is possible values for the unknown parameter beta_0. The height is the density for each value of the unknown parameters. The area under the curve is 1. So the amount of area above any interval of x-values is the probability of observing the parameter in that interval. The top facet is the assumed distribution of the beta_0 before looking at the data, or "the prior". This ideally is just the unit normal we plugged into the model as a modeling assumption. The variation here is that we are using only a finite sample. The bottom facet is the estimated distribution of beta_0 after looking at the data, or the "posterior". Parameters that the observed data is inconsistent with are greatly punished. As we can see the posterior distribution is concentrated not too far from the true value used to generate the data. The model has roughly inferred the coefficients or parameters from the data, even though these are not directly observed.

And for beta_1 we have:

plot_parameter(
  params, 'beta_1', 
  subtitle = 'regular model, beta_1 explanatory')

Notice the posterior distributions are peaked near the unknown true values of the coefficients. The distribution of the estimates represents the remaining uncertainty in the estimate given the prior assumptions and amount of data. If we had more data, these distributions would tighten up considerably.

Experiment 2: Standard Bayesian Inference with beta_1 not explanatory.

Let's try the Bayesian fitting again. However this time let's not have a beta_1:d$x be an influence in in the observed d$y.

n_example <- 100
actual_b0 <- 0.3

d <- data.frame(x = rnorm(n_example))
d$y <- actual_b0 + rnorm(nrow(d), mean = 0, sd = 1)

Notice this time we did not include beta_1 in how the data was generated. beta_1 is still a proposed modeling parameter, so it is up to the inference method to figure out the true relation is zero.

params <- generate_params()
params$posterior_weight <- p_data_given_params(
  params = params,
  data = d,
  fn_factory = mk_fun_p_data_given_params,
  cl = cl)
plot_parameter(
  params, 'beta_0', 
  subtitle = 'regular model, beta_1 not explanatory')
plot_parameter(
  params, 'beta_1', 
  subtitle = 'regular model, beta_1 not explanatory')

And here is where treatments of regression usually stop. The inference correctly worked out that beta_1 is not likely to be any large value.

However, the model almost never says beta_1 = 0! It gets values close to the right answer, but it can not in fact write down the correct answer. That is, it can't reliably set beta_1 = 0 to indicate that the variable x was thought to be irrelevant or non-explanatory. This modeling procedure in fact can not truly fit the above situation, as the above situation requires variable selection. We gave the model a column that wasn't involved in the outcome, and instead of inferring the column was not involved it says the involvement is likely small.

Let's draw few examples according to the posterior weighting, giving us access to the posterior distribution. We might use any one of them as "point estimate" of the inference or fitting.

row_ids <- sample.int(
  length(params$posterior_weight), 
  size = 5, replace = TRUE, 
  prob = params$posterior_weight)

params[row_ids, qc(beta_0, beta_1)] %.>%
  knitr::kable(.)

Usually we see beta_1 not exactly equal to zero. That isn't what we want.

Now, standard frequentist fitting (such as lm) does use distributional assumptions on the residuals to tag the non-explanatory coefficient as non-significant. This isn't all the way to variable pruning, but it is sometimes used as a pruning signal in methods such as step-wise regression.

Let's take a look at that.

model <- lm(y ~ x, data = d)
summary(model)

A non-negligible "Pr(>|t|)" on the coefficient for x is the warning that x may not be truly explanatory. The reported standard errors take a role analogous to the posterior distributions we plotted earlier. (Note bene: Bayesian credible intervals and frequentist confidence intervals represent different kinds of uncertainty, so they are not in fact measuring the same thing.)

Bayesian Variable Selection

We have shown the common fallacy, and we can now show a solution: variable selection.

First we adjust our P[data | parameters] function to include a new unobserved parameter called explanatory. explanatory is always zero or one. The change in code is: we now compute the residual or error as data$y - (pi$beta_1_explanatory * pi$beta_1 * data$x + pi$beta_0). If beta_1_explanatory is 1, then our P[data | parameters] function works as before. However when beta_1_explanatory is 0, the error model changes to data$y - pi$beta_0. beta_1_explanatory == 0 stomps out the pi$beta_1 * data$x term.

We arbitrarily start beta_1_explanatory with a simple prior: it is 0 half of the time and 1 half of the time. We hope after inference it has a posterior distribution that is nearly always 1 (indicating the data is more plausible when beta_1 is not zero, and x is in fact an explanatory variable) or nearly always 0 (indicating that the data is more plausible when beta_0 is zero, and x is not in fact an explanatory variable). Note this prior is actually easy to get right: just find what fraction of the variables you tried on similar previous problems turned out to be useful, and use that as the prior estimate on acceptance rate. However, with enough data we are not sensitive to initial settings away from zero and one.

We define a new unscaled log-probability of error calculation: mk_fun_p_data_given_params_with_conditional.

# per setting parameter calculation function with beta_1_explanatory 0/1 parameter
mk_fun_p_data_given_params_with_conditional <- function(params, data, sd_noise) {
    force(params)
    force(data)
    force(sd_noise)
    function(i) {
      pi <- params[i, , drop = FALSE]
      ei <- data$y - (pi$beta_1_explanatory * pi$beta_1 * data$x  + pi$beta_0)
      sum(dnorm(ei, mean=0, sd = sd_noise, log = TRUE))
    }
}

Essentially pi$beta_1_explanatory * pi$beta_1 is a "zero inflated" parameter. (Note: traditionally in zero inflated models observable outcomes, rather than unobservable parameters are zero inflated.) Our new model is more powerful, as it can now put an atomic (non-negligible) mass exactly on zero.

This new code instantiates the critique implied by the following graphical model.

Again, the only important line of code is the ei <- data$y - (pi$beta_1_explanatory * pi$beta_1 * data$x + pi$beta_0) statement. All other code is just in service of this model specifying line.

Let's see how these are different using this model for inference on the same data.

Experiment 3: Conditional Bayesian Inference with beta_1 not explanatory.

Let's re-run our example where x is not explanatory and we should infer beta_1 = 0. For our software this is the same as inferring beta_1_explanatory * beta_1 = 0 or beta_1_explanatory = 0.

We set up the data and observed relation of y to x.

n_example <- 100
actual_b0 <- 0.3

d <- data.frame(x = rnorm(n_example))
d$y <- actual_b0 + dnorm(nrow(d), mean = 0, sd = 1)

params <- generate_params()
params$posterior_weight <- p_data_given_params(
  params = params,
  data = d,
  fn_factory = mk_fun_p_data_given_params_with_conditional,
  cl = cl)

And we plot what we came for: the prior and posterior distributions of the explanatory parameter.

plot_parameter(
  params, 'beta_1_explanatory', 
  subtitle = 'conditional model, beta_1 not explanatory',
  use_hist = TRUE)

It worked! The model says the data is more consistent with x not being a variable. Note: this is not a frequentist setup, so we have not done something silly or impossible such as "accepting the null hypothesis." Strictly we have only said: the data is plausible under the assumption that x is not an explanatory variable. Also note, this model is of a type forbidden by Stan "Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior." (ref).

We can of course also plot the inferred distributions of the remaining variables.

plot_parameter(
  params, 'beta_0', 
  subtitle = 'conditional model, beta_1 not explanatory')

Keep in mind that beta_1 isn't as interesting in this model formulation as beta_1_explanatory * beta_1, so the distribution of beta_1 isn't as important as the distribution of beta_1_explanatory * beta_1.

params$beta_1_explanatory <- params$beta_1_explanatory * params$beta_1
plot_parameter(
  params, 'beta_1_explanatory', 
  subtitle = 'conditional model, beta_1 not explanatory',
  use_hist = TRUE,
  bins = 100)

Let's draw few examples according to the posterior weighting. We might use any one of them as "point estimate" of the inference or fitting.

row_ids <- sample.int(
  length(params$posterior_weight), 
  size = 5, replace = TRUE, 
  prob = params$posterior_weight)

params[row_ids, qc(beta_0, beta_1_explanatory, beta_1)] %.>%
  knitr::kable(.)

Usually we see beta_1_explanatory exactly equal to zero. That is correct.

Experiment 4: Conditional Bayesian Inference with beta_1 explanatory.

We saw in the last experiment that if we started with a some mass of the distribution of beta_1_explanatory * beta_1 at zero, then after inference for a non-explanatory variable a lot of mass of beta_1_explanatory * beta_1 is at zero.

If a variable is in fact explanatory, then we want to see both beta_1_explanatory move its distribution to 1 and for beta_1_explanatory * beta_1 to move to the right value.

Let's try that experiment.

n_example <- 100
actual_b1 <- 0.7
actual_b0 <- 0.3

d <- data.frame(x = rnorm(n_example))
d$y <- actual_b1 * d$x + actual_b0 + rnorm(nrow(d), mean = 0, sd = 1)

params <- generate_params()
params$posterior_weight <- p_data_given_params(
  params = params,
  data = d,
  fn_factory = mk_fun_p_data_given_params_with_conditional,
  cl = cl)
plot_parameter(
  params, 'beta_1_explanatory', 
  subtitle = 'conditional model, beta_1 beta_1_explanatory', 
  use_hist = TRUE)

This looks good, in this case the posterior distribution for beta_1_explanatory is more concentrated towards 1, indicating the data has evidence for using the associated variable x.

Let's look out our main (composite) parameter of interest beta_1_explanatory * beta_1.

params$beta_1_explanatory <- params$beta_1_explanatory * params$beta_1
plot_parameter(
  params, 'beta_1_explanatory', 
  subtitle = 'conditional model, beta_1 * beta_1_explanatory',
  use_hist = TRUE,
  bins = 100)

The joint estimate of beta_1_explanatory * beta_1 is similar to the unobserved true model that generated the observed data (d$y <- actual_b1 * d$x + actual_b0 + rnorm(nrow(d), mean = 0, sd = 1)).

The procedure worked. The inference recovered the parameters that generated the data. This is going to be a small measure less statistically efficient than a standard regression, as the inference was forced to consider cases. But the procedure seems sound.

# cleanup
if(!is.null(cl)) {
  parallel::stopCluster(cl)
  cl <- NULL
}

Conclusion

Variable selection falls out naturally from the right sort of Bayesian model. If a column is not explanatory, correct fitting procedures can have a high probability of exactly excluding it. If you are using columns that are not all strenuously pre-vetted for your problem you should probably introduce some variable selection to your data science or machine learning workflow. Identifying that a column may have no effect is not the same as saying a column has at most a small effect. Though thinking in terms of mere averages or point estimates sometimes is not rich enough to distinguish these two concepts.

Links



WinVector/vtreat documentation built on Aug. 29, 2023, 4:49 a.m.