knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this short note we present and briefly discuss the R package islasso dealing with regression models having a large number of covariates. Estimation is carried out by penalizing the coefficients via a quasi-lasso penalty, wherein the nonsmooth lasso penalty is replaced by its smooth counterpart determined iteratively by data according to the induced smoothing idea. The package includes functions to estimate the model and to test for linear hypothesis on linear combinations of relevant coefficients. We illustrate R code throughout a worked example, by avoiding intentionally to report details and extended bibliography.
Let $\mathbf{y} = \mathbf{X}\beta + \mathbf{\epsilon}$ be the linear model of interest with usual zero-means and homoscedastic errors. As usual, $\mathbf{y} = (y_1,\ldots,y_n)^T$ is the response vector, $\mathbf{X}$ is the $n \times p$ design matrix (having $p$ quite large) with regression coefficients $\mathbf{\beta}$. When interest lies in selecting the non-noise covariates and estimating the relevant effect, one assumes the lasso penalized objective function (Tibshirani, 1996), $$\frac{1}{2}||\mathbf{y}-\mathbf{X}\mathbf{\beta}||_2^2+\lambda||\mathbf{\beta}||_1$$
The main function of the package are islasso() where the user supplies the model formula as in the usual lm or glm functions, i.e.
islasso(formula, family = gaussian, lambda, alpha = 1, data, weights, subset, offset, unpenalized, contrasts = NULL, control = is.control())
and islasso.path used to fit the regularization path via the induced smoothed lasso framework, i.e.
islasso.path(formula, family = gaussian, lambda = NULL, nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 1E-3, 1E-05), alpha = 1, data, weights, subset, offset, unpenalized, contrasts = NULL, control = is.control())
family accepts specification of family and link function as in Table 1, lambda is the tuning parameter, alpha is elastic-net mixing parameter, nlambda is the number of lambda values, lambda.min.ratio is the smallest value for lambda (as a fraction of lambda.max), and unpenalized allows to indicate covariates with unpenalized coefficients.
Table 1. Families and link functions allowed in islasso
| family | link | |:-------|:----:| | gaussian | identity | | binomial | logit, probit | | poisson | log | | gamma | identity, log, inverse |
The fitter functions are \verb|islasso.fit()| and \verb|islasso.path.fit()| which reads as
islasso.fit(X, y, family = gaussian(), lambda, alpha = 1, intercept = FALSE, weights = NULL, offset = NULL, unpenalized = NULL, control = is.control())
and
islasso.path.fit(X, y, family = gaussian(), lambda, nlambda, lambda.min.ratio, alpha = 1, intercept = FALSE, weights = NULL, offset = NULL, unpenalized = NULL, control = is.control())
whose actually implements the estimating algorithm as described in the paper. The lambda argument in islasso.fit and islasso specifies the positive tuning parameter in the penalized objective. Any non-negative value can be provided, but if missing, it is computed via $K$-fold cross validation by the function cv.glmnet() from package glmnet. The number of folds being used can be specified via the argument nfolds of the auxiliary function is.control(). The lambda argument in islasso.path.fit and islasso.path specifies the sequence of positive tuning parameters, user supplied or automatically computed based on nlambda and lambda.min.ratio.
The Prostate dataset is a well-known benchmark in regression analysis and variable selection, originally published by Stamey et al. (1989). It contains clinical measurements from 97 male patients diagnosed with prostate cancer.
The primary goal is to model the level of prostate-specific antigen (PSA) based on several predictive variables, including:
lcavol: log cancer volume lweight: log prostate weight age: patient’s age lbph: log benign prostatic hyperplasia amount svi: seminal vesicle invasion lcp: log capsular penetration gleason: Gleason score (histological grade of tumor) pgg45: percent of Gleason scores 4 or 5 The response variable is lpsa, the logarithm of PSA levels.
This dataset is particularly useful for testing due to its moderate size and the presence of multicollinearity among predictors.
To select the important terms in the regression equation we could simply apply the lasso using the R package glmnet
library(islasso) data("Prostate", package = "islasso") x <- model.matrix(lpsa ~ ., data = Prostate)[, -1] y <- Prostate$lpsa a1 <- cv.glmnet(x, y, family = "gaussian") n <- nrow(Prostate) a1$lambda.min * n b <- drop(coef(a1, "lambda.min", exact = TRUE)) length(b[b != 0])
Ten-fold cross validation "selects" $\lambda=$ r round(a1$lambda.min * n, 3). corresponding to r length(b[b != 0]) non null coefficients
names(b[b != 0])
The last three estimates are
tail(b[b != 0], n = 3)
A reasonable question is if all the "selected" coefficients are significant in the model. Unfortunately lasso regression does not return standard errors due to nonsmoothness of objective, and some alternative approaches have been proposed., including the (Lockhart et al., 2013). Among the (few) strategies, including the 'covariance test', the 'post-selection inference' and the '(modified) residual bootstrap', here we illustrate the R package islasso implementing the recent `quasi' lasso approach based on the induced smoothing idea (Brown and Wang, 2005) as discussed in Cilluffo et al. (2019)
While the optimal lambda could be selected (without supplying any value to lambda), we use optimal value minimizing a specific criterion chosen between AIC, BIC, AICc, eBIC, GCV or GIC. From version 1.4.0 of the R package islasso optimal strategy is to built the regularization path
out <- islasso.path(lpsa ~ ., data = Prostate, nlambda = 50L, family = gaussian()) out
and then to choose the best tuning parameter through the one of the criteria listed above using the function GoF.islasso.path, e.g.,
lmb.best <- GoF.islasso.path(out) lmb.best$lambda.min
Using also the regularization path is very usefull to have more insights about coefficients, standard errors and gradient profile
p1 <- plot(out, yvar = "coefficients") p2 <- plot(out, yvar = "se") p3 <- plot(out, yvar = "gradient") gridExtra::grid.arrange(p1, p2, p3, ncol = 1L)
Once selected the best lambda value minimizing for example the AIC criterion, the last step of the strategy consists on fitting a new islasso model.
lambda.aic <- lmb.best$lambda.min["AIC"] out2 <- islasso(lpsa ~ ., data = Prostate, lambda = lambda.aic, family = gaussian()) out2
The summary method quickly returns the main output of the fitted model, including point estimates, standard errors and $p$-values. Visualizing estimates for all covariates could be somewhat inconvenient, especially when the number of covariates is large, thus we decide to print estimates only if the pvalue is less than a threshold value. We use 0.10
summary(out2, pval = 0.10)
In addition to the usual information printed by the summary method, the output also includes the column Df representing the degrees of freedom of each coefficient. Their sum is used to quantify the model complexity
sum(out2$internal$hi)
and the corresponding residual degrees of freedom (r out$internal$n - out$rank) as reported above. The Wald test (column z value) and $p$-values can be used to assess important or significant covariates. Results suggest that variables lcavol, lweight and svi are informative to predict the measure of the logarithm of PSA levels, while lbph is borderline informative. Just to be clear, another way to obtain a similar result without computing the regularization path, is to use the function aic.islasso which requires a preliminary islasso fit object and a specification of the criterion to be used. Hence
lambda.aic2 <- aic.islasso(out2, method = "AIC", interval = c(.1, 50)) out3 <- update(out2, lambda = lambda.aic2) summary(out3, pval = .10)
Comparisons between methods to select the tuning parameter and further discussions are out of the scope of this short note. We conclude this note by emphasizing that islasso also accepts the so-called elastic-net penalty, such that $$ \frac{1}{2}||\mathbf{y}- \mathbf{X\beta}||2^{2}+\lambda { \alpha ||\mathbf{\beta} ||^{}_1 + \frac{1}{2}(1-\alpha) ||\mathbf{\beta} ||^{2}_2 } $$ where $0\le \alpha\le 1$ is the mixing parameter to be specified in _islasso() and islasso.path() via the argument alpha, e.g.
# update the islasso path to fit an elastic-net model out4 <- update(out, alpha = .5) out4 # some diagnostic plot p4 <- plot(out4, yvar = "coefficients") p5 <- plot(out4, yvar = "se") gridExtra::grid.arrange(p4, p5, ncol = 1L) # select the best tuning parameter lmb.best2 <- GoF.islasso.path(out4) lmb.best2$lambda.min # fit a new islasso model with elastic-net penalty lambda.aic3 <- lmb.best2$lambda.min["AIC"] out5 <- update(out2, alpha = .5, lambda = lambda.aic3) summary(out5, pval = .10) # or select the best tuning parameter using AIC with an islasso object lambda.aic4 <- aic.islasso(out5, method = "AIC", interval = c(.1, 100)) out6 <- update(out5, lambda = lambda.aic4) summary(out6, pval = .10)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.