set.seed(150)
r packageVersion("lsgl")
)library(lsgl)
Load data containing N samples and p features (covariates) and a response matrix containing K responses for each sample:
X <- # load design matrix (of size N x p) Y <- # load response matrix (of size N x K)
For the purpose of this tutorial we will load a data set consisting of airline ticket prices
data(AirlineTicketPrices) dim(X) dim(Y)
Hence, p = 411, N = 337 and the dimension of the response K = 6, this implies that the model has 6*(411+1) = 2472 parameters.
Let us take out a small test set:
idx <- sample(1:nrow(X), size = 50) Xtest <- X[idx, ] Ytest <- Y[idx, ] X <- X[-idx, ] Y <- Y[-idx, ]
Choose lambda
(fraction of lambda.max) and alpha
, with alpha = 1
for lasso, alpha = 0
for group lasso and alpha
in the range (0,1) for spares group lasso.
Use lsgl::cv
to estimate the error for each lambda in a sequence decreasing from the data derived lambda max to lambda
* lambda max.
Lambda max is the lambda at which the first penalized parameter becomes non-zero.
A smaller lambda
will take longer to fit and include more features.
The following command will run a 10 fold cross validation for each lambda value in the lambda sequence using 2 parallel units (using the foreach and doParallel packages.
cl <- makeCluster(2) registerDoParallel(cl) # Do cross validation -- this may take some time fit.cv <- lsgl::cv(X, Y, fold = 10, alpha = 0.5, lambda = 0.001, use_parallel = TRUE) stopCluster(cl)
(for the current version no progress bar will be shown)
Get a summery of the validated models. We have now cross validated the models corresponding to the lambda values, one model for each lambda value. We may get a summery of this validation by doing:
fit.cv
Hence, the best model is obtained using lambda index r best_model(fit.cv)
and it has a cross validation error of r round(Err(fit.cv)[best_model(fit.cv)],2)
. The expected number of selected features is r colMeans(features_stat(fit.cv))[best_model(fit.cv)]
and the expected number of parameters is r colMeans(parameters_stat(fit.cv))[best_model(fit.cv)]
.
Use lsgl to fit a final model.
fit <- lsgl::fit(X, Y, alpha = 0.5, lambda = 0.01)
Get a summery of the estimated models
fit
Take a look at the estimated models.
As we saw in the previous step the model with index r best_model(fit.cv)
had the best cross validation error, we may take a look at the included features using the command:
features(fit)[[best_model(fit.cv)]][1:10] # Ten first non-zero features in best model
Hence r length(features(fit)[[best_model(fit.cv)]])
features are included in the model, this is close to the expected number based on the cross validation estimate.
The sparsity structure of the parameters belonging to these r length(features(fit)[[best_model(fit.cv)]])
features may be viewed using
image(parameters(fit)[[best_model(fit.cv)]])
We may also take a look at the estimate parameters (or coefficients)
coef(fit, best_model(fit.cv))[,1:5] # First 5 non-zero parameters of best model
If we count the total number of non-zero parameters in the model we get, in this case r sum(parameters(fit)[[best_model(fit.cv)]])
which is close to the expected based on the cross validation estimate.
Load test data containing M samples and p features.
Xtest <- # load matrix with test data (of size M x p)
Use the final model to predict the price vector of the M=50 samples in Xtest.
res <- predict(fit, Xtest)
Plot predicted and true response
image(Ytest, main = "Observed prices") image(res$Yhat[[best_model(fit.cv)]], main = "Predicted prices")
Compute the error rates on the test set
plot(Err(fit, Xtest, Ytest), xlab = "lambda index")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.