knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.show='hold', fig.width=4, fig.height=3.4) set.seed(4395)
droplasso
is a package that fits a generalized linear model via maximum likelihood regularized by droplasso [@Khalfaoui2018DropLasso], a procedure that combines dropout [@Srivastava2014Dropout] and lasso [@Tibshirani1996Regression] regularizations.
Given a training set of samples $x_1,\ldots,x_n\in\mathbb{R}^d$ with labels $y_1,\ldots,y_n\in\mathbb{R}$, the droplasso regularization estimates a linear model $f(x)=w^\top x$ for $x\in\mathbb{R}^d$ by solving: \begin{equation} \min_{w \in \mathbb{R}^d} \left{ \frac{1}{n} \sum_{i=1}^{n} \underset{\delta_i \sim B(p)^d}{ \mathbb {E}} L\left(w,\delta_i \odot \frac{x_{i,}}{p} , y_{i} \right) + \lambda \left \| w\right \|_{1} \right} \,, \end{equation} where $L(w,x,y)$ is a negative log likelihood of a linear model $w$ on a observation $x$ with label $y$, $B(p)$ is the Bernoulli distribution with parameter $p$, and $\odot$ is the entry-wise multiplication of vectors. When $p=1$, each $\delta_i$ is almost surely a vector of $1$'s and droplasso boils down to classical lasso; and when $\lambda=0$, droplasso boils down to dropout.
Here we illustrate the use of droplasso on simulated data, and compare it to standard dropout and elastic net regularisation. We design a toy simulation to illustrate in particular how corruption by dropout noise impacts the performances of the different methods. The simulation goes as follow :
Let us simulate $n=100$ samples to form the training set, and $10,000$ samples to test the model:
library(mvtnorm) generate_data <- function(n=100, d=100, d1=10, pi=1, w=0.05, q=0.4) { # The samples z mu <- rep(0,d) Sigma <- (matrix(1, nrow=d, ncol=d) + diag(d))/2 rawvars <- rmvnorm(n, mean=mu, sigma=Sigma) pvars <- pnorm(rawvars) z <- qpois(pvars, pi) # The sparse model w w <-c(rep(w,d1),rep(0,d-d1)) # The labels y y <- rbinom(n, 1, 1/(1+exp(-z %*% w)) ) # The corrupted samples x x <- sapply(1:d, function(i) z[,i] * rbinom(n,1,q)) return(list(z=z, x=x, y=y, w=w)) } data_train <- generate_data() data_valid <- generate_data(n=10000) data_test <- generate_data(n=10000)
Let us illustrate how to fit a Droplasso model and test it.
Let us train a droplasso model with default parameter ($p$=0.5) for a binary classification problem (i.e., using the logistic loss function). The function droplasso
allows to directly fit the model over a grid of $\lambda$ to estimate the regularization path, so here we compute the regularization path over 50 values of $\lambda$:
library(droplasso) m <- droplasso(data_train$x, data_train$y, family="binomial", nlambda=50)
Since we did not speficy a value for the regularization parameter $\lambda$, the function has computed all droplasso models over a grid of $\lambda$. We can visualize the resulting regularization path, i.e., the model weights as a function of $\lambda$:
plot(m, main="Droplasso p=0.5")
Let us see what the model predicts on the test set (for all $\lambda$ values)
ypred <- predict(m, data_test$x)
Since we know the true labels, we can compute the area under the ROC curve (AUC) to evaluate the quality of the predictions
library(ROCR) pred <- sapply(seq(ncol(ypred)), function(j) {prediction(ypred[,j], data_test$y)}) auc <- unlist(lapply(pred, function(p) {performance(p, "auc")@y.values[[1]]})) plot(log(m$lambda), auc, type='l', xlab="log(lambda)", ylab="AUC", main="Droplasso AUC on the test set") grid()
We see that, as $\lambda$ decreases and the model goes from the null model to models with more variables, the AUC sharply increases and then decreases, which illustrates the benefits of regularization.
So far we have only tested droplasso with the default dropout parameter $p$ set to 0.5 . However, $p$ also is also a regularization parameter that we can tune between 0 (high regularization) and 1 (no regularization). Note that when $p$ is small, the optimization is more challenging and it may be useful to increase the number of epoch of the proximal stochastic gradient descent algorithm by playing with the n_passes
parameter of the droplasso
function.
Let us see how the performance varies when we vary $p$. Since the convergence of the optimization is harder to reach for small values of $p$, the performance is subject to more fluctuation. In order to pick the "best" AUC, we therefore use a validation set to select the "best" $\lambda$, and then assess the AUC of that particular $\lambda$ only on the test set.
# The values of p we want to test probalist <- 0.6^seq(0,10) # The number of lambda values we want to test to form the regularization path nlambda <- 100 # The number of epoch of the optimization procedure n_passes = 10000 auclist <- c() for (proba in probalist) { # Train on the train set m <- droplasso(data_train$x, data_train$y, family="binomial", keep_prob=proba, nlambda=nlambda, n_passes = n_passes) # Pick lambda with the best AUC on the validation set ypred <- predict(m, data_valid$x) pred <- sapply(seq(ncol(ypred)), function(j) {prediction(ypred[,j], data_valid$y)}) auc <- unlist(lapply(pred, function(p) {performance(p, "auc")@y.values[[1]]})) bestlambda <- m$lambda[which.max(auc)] # Assess AUC of the best lambda on the test set ypred <- predict(m, data_test$x, s=bestlambda) auc <- performance(prediction(ypred, data_test$y), "auc")@y.values[[1]] auclist <- c(auclist,auc) } plot(log(probalist), auclist, type='b', xlab='ln(p)', ylab='Test AUC', main='Droplasso: AUC vs p for best lambda')
As explained in [@Khalfaoui2018DropLasso], droplasso is related to elastic net regularization [@Zou2005Regularization]. Let us check how elastic net behaves on the same dataset.
library(glmnet) # The values of alpha to interpolate between lasso (alpha=1) and ridge (alpha=0) alphalist <- seq(0,1,0.1) auclist <- c() firstalpha <- TRUE for (alpha in alphalist) { # Train glmnet model m_glm <- glmnet(data_train$x, data_train$y, family="binomial", intercept=F, alpha=alpha, standardize = F, nlambda=nlambda) # Predict on the validation set ypred <- data_valid$x %*% m_glm$beta pred <- sapply(seq(ncol(ypred)), function(j) {prediction(ypred[,j], data_valid$y)}) auc <- unlist(lapply(pred, function(p) {performance(p, "auc")@y.values[[1]]})) if (length(auc) < nlambda) { # glmnet did not converge for the last lambda values auc <- c(auc, rep(0.5, nlambda-length(auc))) } if (firstalpha) { auc_el <- auc firstalpha <- FALSE } else { auc_el <- cbind(auc_el, auc) } # Assess AUC of the best lambda on the test set bestlambdaindex <- which.max(auc) ypred <- data_test$x %*% m_glm$beta[,bestlambdaindex] auc <- performance(prediction(ypred, data_test$y), "auc")@y.values[[1]] auclist <- c(auclist,auc) } plot(alphalist, auclist[1:11], type='b', xlab='alpha', ylab='Test AUC', main='Glmnet with best lambda')
We see that, similar to droplasso, glmnet works best when there is no lasso regularization, and a lot of ridge regularization, as shown on the following plot which details the performance of glmnet for various lambda and alpha.
matplot(auc_el, type='l', lty=1, xlab='lambda index', ylab='Validation AUC', main="Glmnet, different alpha's", legend=alphalist) grid()
The best curve is for alpha=0
(ridge regression), and the best values for large $\lambda$ (left of the plot).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.