A beneficial property of the lasso penalty is that it shrinks coefficients to zero. Less beneficial is that in the process, the lasso tends to overshrink large coefficients. It has been argued that the lasso should "be considered as a variable screener rather than a model selector" ([@su2017false]). There appears a trade-off between prediction and selection: The penalty parameter value optimal for variable selection will overshrink the large coefficients, making it suboptimal for prediction. The penalty parameter value optimal for prediction ("lambda.min") selects too many variables. The relaxed lasso and adaptive lasso have been proposed to improve upon this problem.
When fitting prediction rule ensembles (PREs), the high false-positive selection rate of the lasso may be a nuisance, because often we want to obtain a sparse set of rules and linear terms. This vignette aims to show how the relaxed and adaptive lassos may deliver sparser ensembles, while retaining (relatively) high predictive accuracy.
An excellent discussion of consistency, predictive performance and selection accuracy of the lasso and its variations is provided in the master's thesis of [@Kirkland2014].
The relaxed lasso was originally proposed by [@meinshausen2007relaxed]. Investigations of [@su2017false] provide insight on why the relaxed lasso is beneficial. [@hastie2017extended] propose a simplified version of the relaxed lasso, which is implemented in package glmnet
and can be employed in package pre
. [@hastie2017extended] find that "best subset selection generally performing better in high signal-to-noise (SNR) ratio regimes, and the lasso better in low SNR regimes" and that "the relaxed lasso [...] is the overall winner, performing just about as well as the lasso in low SNR scenarios, and as well as best subset selection in high SNR scenarios". Function pre
supports use of the relaxed lasso through passing of argument relax
. A short introduction to the relaxed lasso is provided in glmnet
vignette "The Relaxed lasso", accessible in R
by typing vignette("relax", "glmnet")
.
The adaptive lasso has been proposed by [@hui2006the]. It applies positive weighting factors to the lasso penalty to control the bias through shrinking coefficients with weights inversely proportional to their size. It thus aims to shrink small coefficients more and large coefficients less. It requires an initial estimate of the coefficients, for which OLS (if $N > p$) or ridge (if $N < p$) estimation is usually employed, to obtain a vector of weights. These weights can then be used to scale the predictor matrix, or to scale the penalty; both approaches have the same effect.
Function pre
allows for adaptive lasso estimation through specification of argument ad.alpha
. It first uses ridge regression for computing penalty weights. In principle, any elastic net solution can be used, but use of ridge is recommended. Other solutions can be used by specifying the ad.alpha
and ad.penalty
arguments. Lasso regression can be used by specifying ad.alpha = 1
and OLS can be used by specifying ad.penalty = 0
. Next, the inverse of the absolute values of the estimated coefficients are supplied as penalty factors to the cv.glmnet
function. For the initial and final estimates, the same fold assignments are used in the cross validation.
It should not come as a surprise that a combination has also been proposed: The relaxed adaptive lasso ([@zhang2022relaxed]). It can easily be employed through specifying both ad.alpha = 0
and relax = TRUE
.
library("pre")
We fit a PRE to predict Ozone
and employ the relaxed lasso by specifying relax = TRUE
:
airq <- airquality[complete.cases(airquality), ] set.seed(42) airq.ens.rel <- pre(Ozone ~ ., data = airq, relax = TRUE)
If we specify relax = TRUE
, the gamma
argument (see ?cv.glmnet
for documentation on arguments relax
and gamma
) will by default be set to a range of five values in the interval [0, 1]. This can be overruled by specifying different values for argument gamma
in the call to function pre
(but the default likely suffices in most applications).
We take a look at the regularization paths for the relaxed fits:
## Overwrite glmnet plotting function, because secondary x axis and legend do ## not use cex, cex.axis or cex.lab arguments plot.cv.relaxed <- function (x, se.bands = TRUE, cex = .7, cex.axis= .7, cex.lab = .7, cex.main = .85,...) { xr = x$relaxed oldpar = par(mar = c(4, 4, 3, 4)) on.exit(par(oldpar)) statlist = xr$statlist gamma = xr$gamma ngamma = length(gamma) ylim = range(unlist(lapply(statlist, "[[", "cvm"))) if (se.bands) { cvup = lapply(statlist, "[[", "cvup") cvlo = lapply(statlist, "[[", "cvlo") ylim = range(ylim, unlist(cvup), unlist(cvlo)) } xlim = log(range(unlist(lapply(statlist, "[[", "lambda"))) + 1e-05) cvcolors = rainbow(ngamma, start = 0.1, end = 1) with(statlist[[ngamma]], plot(log(lambda), cvm, type = "n", xlab = expression(Log(lambda)), ylab = x$name, ylim = ylim, xlim = xlim, cex = cex, cex.axis= cex.axis, cex.lab = cex.lab, cex.main = cex.main)) if (se.bands) { for (i in seq(ngamma)) with(statlist[[i]], polygon(c(log(lambda), rev(log(lambda))), c(cvup, rev(cvlo)), col = "floralwhite", border = "antiquewhite")) } for (i in seq(ngamma)) with(statlist[[i]], lines(log(lambda), cvm, lwd = 1, col = cvcolors[i])) mins = log(c(xr$lambda.min, xr$lambda.1se)) abline(v = mins, lty = 3) dof = statlist[[1]]$nzero lambda = statlist[[1]]$lambda axis(side = 3, at = log(lambda), labels = paste(dof), tick = FALSE, line = 0, cex = cex, cex.axis= cex.axis, cex.lab = cex.lab) shape::colorlegend(posy = c(0.2, 0.8), posx = c(0.93, 0.945) - 0.03, col = rainbow(ngamma, start = 0.1, end = 1), zlim = c(0, 1), zval = gamma, main = expression(gamma), digit = 2, cex=cex) invisible() }
plot(airq.ens.rel$glmnet.fit)
We obtained one regularization path for each value of $\gamma$. $gamma$ is a mixing parameter, that determines the weight of the original lasso solution, relative to a solution containing only the selected variables, but with unpenalized coefficient estimates. The path for $\gamma = 1$ is the default lasso path, which we would also have obtained without specifying relax = TRUE
. Lower values of $\gamma$ 'unshrink' the value of the non-zero coefficients of the lasso towards their unpenalized values. We see that for the $\lambda$ value yielding the minimum MSE (indicated by the left-most vertical dotted line), the value of $\gamma$ does not make a lot of difference for the MSE, but when $\lambda$ values increase, higher values of $\gamma$ tend to improve predictive performance. This is a common pattern for $\lambda$ and $\gamma$.
For model selection using the "lambda.min"
criterion, by default the $\lambda$ and $\gamma$ combination yielding the lowest CV error is returned. For the "lambda.1se"
criterion, the $\lambda$ and $\gamma$ combination yielding the sparsest solution within 1 standard error of the error criterion of the minimum is returned:
fit <- airq.ens.rel$glmnet.fit$relaxed mat <- data.frame(lambda.1se = c(fit$lambda.1se, fit$gamma.1se, fit$nzero.1se), lambda.min = c(fit$lambda.min, fit$gamma.min, fit$nzero.min), row.names = c("lamda", "gamma", "# of non-zero terms")) mat
Thus, as the dotted vertical lines in the plots already suggest, with the default "lambda.1se"
criterion, a final model with r fit$nzero.1se
terms will be selected, with coefficients obtained using a $\lambda$ value of r round(fit$lambda.1se, digits = 3L)
and a $\gamma$ value of r fit$gamma.1se
. With the "lambda.min"
criterion, we obtain a more complex fit; $\gamma = 0$ still yields the lowest CV error. Note that use of "lambda.min"
increases the likelihood of overfitting, because function pre
uses the same data to extract the rules and fit the penalized regression, so in most cases the default "lambda.1se"
criterion can be expected to provide a less complex, better generalizable, and often more accurate fit.
The default of function pre
is to use the "lambda.1se"
criterion. When relax = TRUE
has been specified in the call to function pre
, the default of all functions and S3
methods applied to objects of class pre
(print
, plot
, coef
, predict
, importance
, explain
, cvpre
, singleplot
, pairplot
, interact
) is to use the solution obtained with "lambda.1se"
and the $\gamma$ value yielding lowest CV error at that value of $\lambda$. This can be overruled by specifying a different value of $\lambda$ (penalty.par.val
) and/or $\gamma$ (gamma
). Some examples:
summary(airq.ens.rel) summary(airq.ens.rel, penalty = "lambda.min") summary(airq.ens.rel, penalty = 8, gamma = 0) summary(airq.ens.rel, penalty = 8, gamma = 1)
Note how the lowest CV error is indeed obtained with the "lambda.min"
criterion, while the default "lambda.1se"
yields a sparser model, with accuracy within 1 standard error of "lambda.min"
. If we want to go (much) sparser, we need to specify a lower value for the $\lambda$ penalty, and a lower value of $\gamma$ should likely be preferred, to retain good-enough predictive accuracy.
Some rules for specification of $\lambda$ and $\gamma$:
If a numeric value of $\lambda$ has been supplied, a (numeric) value for $\gamma$ must be supplied.
Otherwise (if the default "lambda.1se"
criterion is employed, or "lambda.min"
specified), the $\gamma$ value yielding lowest CV error (at the $\lambda$ value associated with the specified criterion) will be used; this $\gamma$ value can be overruled by supplying the desired $\gamma$ value to the gamma
argument.
Multiple values of $\gamma$ can be passed to function pre
, but all other methods and functions accept only a single value for $\gamma$ (this differs from several glmnet
functions) .
If a specific $\lambda$ value is supplied, results are returned for a penalty parameter value that was used in the path, and closest to the specified value.
Also note that in the code chunk above we refer to the penalty.par.val
argument by abbreviating it to penalty
; this has the same effect as writing penalty.par.val
in full.
Using $\gamma = 0$ amounts to a forward stepwise selection approach, with entry order of the variables (rules and linear terms) determined by the lasso. This approach can be useful if we want a rule ensemble with low complexity and high generalizability, and especially when we want to decide a-priori on the number of terms we want to retain. By specifying a high value of $\lambda$, we can retain a small number of rules, while specifying $\gamma = 0$ will provide unbiased (unpenalized) coefficients. This avoids the overshrinking of large coefficients. In terms of predictive accuracy, this approach may not perform best, but if low complexity (interpretability) is most important, this is a very useful approach, which does not reduce predictive accuracy too much.
To use forward stepwise regression with variable entry order determined by the lasso, we specify a $\gamma$ value of 0, and specify the number of variables we want to retain through specification of $\lambda$ (penalty.par.val
). To find the value of $\lambda$ corresponding to the number of terms one want to retain, check (results not shown for space considerations):
airq.ens.rel$glmnet.fit$glmnet.fit
Function prune_pre
is helpful for selecting sparser ensembles. Say, we want to retain an ensemble with only five rules, then prune_pre
will return the $\lambda$ and $\gamma$ values that yield an ensemble of specified size, with optimal cross-validated predictive accuracy.
Here, we request the optimal parameter values for a five-term ensemble:
opt_pars <- prune_pre(airq.ens.rel, nonzero = 5)
Note that the mean_cv_error
may be slightly optimistic. Cross validation was performed on the same data that was used the generate the rules. A less optimistic estimate of generalization error can be obtained using function cvpre
.
Finally, we fit a PRE with adaptive lasso to predict Ozone
levels:
set.seed(42) airq.ens.ad <- pre(Ozone ~ ., data = airq, ad.alpha = 0) summary(airq.ens.ad)
The adaptive lasso did not provide a sparser ensemble, while the mean CV error suggests better predictive accuracy than the standard, but not the relaxed, lasso. Adaptive lasso settings can further be adjusted by specification of argument ad.penalty
.
We can also fit a rule ensemble using the relaxed adaptive lasso:
set.seed(42) airq.ens.rel.ad <- pre(Ozone ~ ., data = airq, relax = TRUE, ad.alpha = 0) print(airq.ens.rel.ad)
The summary suggests that the relaxed adaptive lasso provides the highest predictive accuracy compared to the standard, the relaxed and the adaptive lasso, when using the default "lambda.1se"
criterion. Note however that the training data have been used to generate the rules, to estimate the weights for the penalty factors using ridge regression and to estimate the final lasso model. Thus, the printed CV error can provide an overly optimistic estimate of predictive accuracy. To obtain an honest estimate of predictive accuracy, it should be computed on a separate test dataset or using an additional layer of cross validation (e.g., using function cvpre
or other approach).
Use of the relaxed lasso improves accuracy and sparsity of the final ensemble. Relaxed lasso can be used to obtain an ensemble of pre-specified sparsity, that still provides good predictive performance. Use of the adaptive lasso penalties may further improve predictive accuracy.
In case you obtained different results, the results above were obtained using the following:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.