Quick Start

set.seed(150)

Quick Start (for lsgl version r packageVersion("lsgl"))

1. Load the msgl library in R

library(lsgl)

2. Load your data

Load data containing N samples and p features (covariates) and a response matrix containing K responses for each sample:

X <- # load design matrix (of size N x p)
Y <- # load response matrix (of size N x K)

For the purpose of this tutorial we will load a data set consisting of airline ticket prices

data(AirlineTicketPrices)
dim(X)
dim(Y)

Hence, p = 411, N = 337 and the dimension of the response K = 6, this implies that the model has 6*(411+1) = 2472 parameters.

Let us take out a small test set:

idx <- sample(1:nrow(X), size = 50)

Xtest <- X[idx, ]
Ytest <- Y[idx, ]
X <- X[-idx, ]
Y <- Y[-idx, ]

3. Estimate error using cross validation

Choose lambda (fraction of lambda.max) and alpha, with alpha = 1 for lasso, alpha = 0 for group lasso and alpha in the range (0,1) for spares group lasso.

Use lsgl::cv to estimate the error for each lambda in a sequence decreasing from the data derived lambda max to lambda * lambda max. Lambda max is the lambda at which the first penalized parameter becomes non-zero. A smaller lambda will take longer to fit and include more features. The following command will run a 10 fold cross validation for each lambda value in the lambda sequence using 2 parallel units (using the foreach and doParallel packages.

cl <- makeCluster(2)
registerDoParallel(cl)

# Do cross validation -- this may take some time
fit.cv <- lsgl::cv(X, Y, fold = 10, alpha = 0.5, lambda = 0.01, use_parallel = TRUE)

stopCluster(cl)

(for the current version no progress bar will be shown)

Get a summery of the validated models. We have now cross validated the models corresponding to the lambda values, one model for each lambda value. We may get a summery of this validation by doing:

fit.cv

Hence, the best model is obtained using lambda index r best_model(fit.cv) and it has a cross validation error of r round(Err(fit.cv)[best_model(fit.cv)],2). The expected number of selected features is r colMeans(features_stat(fit.cv))[best_model(fit.cv)] and the expected number of parameters is r colMeans(parameters_stat(fit.cv))[best_model(fit.cv)].

4. Fit the final model

Use lsgl to fit a final model.

fit <- lsgl::fit(X, Y, alpha = 0.5, lambda = 0.01)

Get a summery of the estimated models

fit

Take a look at the estimated models. As we saw in the previous step the model with index r best_model(fit.cv) had the best cross validation error, we may take a look at the included features using the command:

features(fit)[[best_model(fit.cv)]][1:10] # Ten first non-zero features in best model

Hence r length(features(fit)[[best_model(fit.cv)]]) features are included in the model, this is close to the expected number based on the cross validation estimate.

The sparsity structure of the parameters belonging to these r length(features(fit)[[best_model(fit.cv)]]) features may be viewed using

image(parameters(fit)[[best_model(fit.cv)]])

We may also take a look at the estimate parameters (or coefficients)

coef(fit, best_model(fit.cv))[,1:5] # First 5 non-zero parameters of best model

If we count the total number of non-zero parameters in the model we get, in this case r sum(parameters(fit)[[best_model(fit.cv)]]) which is close to the expected based on the cross validation estimate.

6. Use your model for predictions

Load test data containing M samples and p features.

Xtest <- # load matrix with test data (of size M x p)

Use the final model to predict the price vector of the M=50 samples in Xtest.

res <- predict(fit, Xtest)

Plot predicted and true response

image(Ytest, main = "Observed prices")
image(res$Yhat[[best_model(fit.cv)]], main = "Predicted prices")

Compute the error rates on the test set

plot(Err(fit, Xtest, Ytest), xlab = "lambda index")


Try the lsgl package in your browser

Any scripts or data that you put into this service are public.

lsgl documentation built on May 29, 2017, 11:43 a.m.