knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%" )
A variable selection and clustering method for generalized linear models
R package glasp
is an extension of the Sparse Group Lasso that computes groups automatically. The internal supervised variable clustering algorithm is also an original contribution and integrates naturally within the Group Lasso penalty.
Moreover, this implementation provides the flexibility to change the risk function and address any regression problem.
To install glasp
in R use
devtools::install_github("jlaria/glasp", dependencies = TRUE)
In this example, we show how to integrate glasp
with parsnip
and other tidy libraries.
We first load the required libraries.
library(glasp) library(parsnip) library(yardstick)
Next, we simulate some linear data with the simulate_dummy_linear_data
function.
set.seed(0) data <- simulate_dummy_linear_data()
A glasp
model can be computed using different approaches. This is the parsnip
approach.
model <- glasp_regression(l1=0.01, l2=0.001, frob=0.0003) %>% set_engine("glasp") %>% fit(y~., data)
The coefficients can be accessed through the parsnip
model object model$fit$beta
.
print(model)
Now we generate some out-of-sample data to check how the model predicts.
set.seed(1) new_data <- simulate_dummy_linear_data() pred <- predict(model, new_data) rmse <- rmse_vec(new_data$y, pred$.pred)
We obtain a r rmse
root mean square error.
Hyper-parameter search is quite easy using the tools glasp
integrates with. To show this, we will simulate some survival data.
set.seed(0) data <- simulate_dummy_surv_data() head(data)
We create the glasp model, but this time we do not fit it. Instead, we will call the tune
function.
library(tune) model <- glasp_cox(l1 = tune(), l2 = tune(), frob = tune(), num_comp = tune()) %>% set_engine("glasp")
We specify k-fold cross validation and Bayesian optimization to search for hyper-parameters.
library(rsample) data_rs <- vfold_cv(data, v = 4) hist <- tune_bayes(model, event~time+., # <- Notice the syntax with time in the right hand side resamples = data_rs, metrics = metric_set(roc_auc), # yardstick's roc_auc iter =10, # <- 10 iterations... change to 1000 control = control_bayes(verbose = TRUE)) # show_best(hist, metric = "roc_auc")
library(glasp) library(yardstick) set.seed(0) data <- simulate_dummy_logistic_data() model <- logistic_regression(y~., data, l1=0.01, l2=0.001, frob=0.001, ncomp=2) print(model) pred = predict(model, data)
library(rsample) set.seed(0) data <- simulate_dummy_logistic_data() model <- glasp_classification(l1 = tune(), l2 = tune(), frob = tune(), num_comp = tune()) %>% set_engine("glasp") data_rs <- vfold_cv(data, v = 4) hist <- tune_grid(model, y~., resamples = data_rs, metrics = metric_set(roc_auc), grid =10, control = control_grid(verbose = FALSE)) # show_best(hist, metric = "roc_auc")
To replicate the environment used in development, you need
After installing the requisites above,
git clone https://https://github.com/jlaria/glasp.git
File/Open Folder
and open the root folder of this project glasp
.>Remote-Containers:Reopen in Container
)http://localhost:8787
username:rstudio, password:rstudio. That's it! You can use an isolated rstudio-server for development.
See inst/rstudio
for more details about the environment.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.