Linear regression is one of the most widely used statistical methods and with the ease of modern data collection, often many predictors are associated with a response variable. However, most of the available variable selection methodologies (e.g., elastic net and its special case lasso) involve careful selection of the tuning parameter. This can cause numerous problems in practice and may lead to biased estimates. We present an automatic variable selection method using anew risk criteria which is simple and compares favorably with many of the currently available variable selection methods for linear models. Numerical illustrations based on simulated and real data are presented based on output from an user friendly R package AutoLasso
that is made available for easy implementation.
Required R packages: glmnet
and MASS
Author: Sujit Ghosh, Kaiyuan(Carl) Duan, and Guangjie Yu
Consider the familiar linear (multiple) regression model:
,
where
’s satisfy the usual Gauss-Markov assumptions.
Step 1: Standardize the response and predictor variables:
and
for all
and
Let
denote the centered response vector and
denotes the
centered and scaled design matrix
Step 2: Get an initial (consistent) estimator of (e.g., using the
glmnet
} R package)
and denote it by
Step 3: Choose a subset of variables minimizing the risks:
(i) Order the absolute values of 's and choose first
variables
(ii) Compute the risks:
where
is the least square estimate based on top
variables
selected in step (i) above
(iii) Chose
that minimizes the
, where
Step 4: Output the with
non-zero entries obtained from Step 3(iii) and rest entries set to zeros. Also output
Generate data using following scenarios:
,
Two Choices for correlation matrix :
(i) AutoRegressive of order 1 (AR(1)):
(ii) Compound symmetry:
where
Error variance is chosen to have desired signal-to-noise ratio (SNR) :
(e.g.,
etc.)
We explore various combinations of () for above scenarios each based on 1000 simulated data replicates
We concerned Overall Accuracy Rates and Biases:
False Discovery Rate
Biases (Performance in reference)
| Methods | TPR | FPR | FDR | |---------|------| -------|------| |True | 1 | 0.000 | 0.000| |AutoLASSO| 1 | 0.000 | 0.002| | LASSO| 1 | 0.044 | 0.631|
| Methods | TPR | FPR | FDR | |---------|------| -------|------| |True | 1 | 0.000 | 0.000| |AutoLASSO| 1 | 0.000 | 0.000| | LASSO| 1 | 0.064 | 0.715|
To study the application of Automated LASSO approach, we compare with linear regression lm
and Classic LASSO method. Data was taken from the R package COVID19
available to the public for educational and scientific use. Due to the outbreak of corona-virus in the United States, exploding in March and stable in the beginning of June, we selected the data based on the date between March 1st and June 10th, 2020. The dataset presented in this paper is four predictors (selected from a preliminary analysis of a set of five variables) modeled on a selected response variable. There were 5202 observations. Conclusions about the data were not desired, the data was used only to observe how linear regression lm
, Automated LASSO, and classic LASSO models perform on highly correlated data sets. All computation was done with R software using the package glmnet
and AutoLasso
.
| day.ahead | AutoLASSO| LASSO | lm | |--------- |--------- | ------- |------ | |1 | 2(0.99) | 3(0.99) | 57(0.97)| |5 | 4(0.98) | 46(0.97) | 57(0.95)| | 7 | 8(0.98) | 47(0.96) | 57(0.94)| | 10 | 21(0.95)| 46(0.95) | 57(0.93)|
Kendall’s Tau measures the association predicted and hold observations AutoLasso achieves almost same accuracy of prediction using much smaller subset of predictor variables
USPROC 2020 Electronic Undergraduate Statistics Research Conference: https://www.causeweb.org/usproc/eusrc/2020/virtual-posters/6
x=model.matrix(log1p(confirmed) ~ log1p(confirmed.lag)+day+cancel+internal+state-1,data=covid.data) y=log1p(confirmed)
cv.lasso.fit=cv.glmnet(x,y) lasso.fit.glmnet=glmnet(x,y) log.confirmed.pred=predict(cv.lasso.fit,newx=x[test,],type='response',s="lambda.1se") p.cor=cor(log1p(confirmed[test]),log.confirmed.pred)
autolasso.fit=auto.lasso(x,y) log.confirmed.pred=cbind(rep(1,length(test)),x[test,])%*%autolasso.fit$coef p.cor=cor(log1p(confirmed[test]),log.confirmed.pred)
model.compare(n=100,ratio=0.5,p.true=5,rho=0,cor.type=1,snr=10,N.sim=100)
sim.study(n=100,N.sim=100,ratio=2,p.true=5,rho=0.25,cor.type=1,snr=10,method=1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.