knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.show='hold', fig.width=4, fig.height=3.4)
set.seed(4395)

Introduction

droplasso is a package that fits a generalized linear model via maximum likelihood regularized by droplasso [@Khalfaoui2018DropLasso], a procedure that combines dropout [@Srivastava2014Dropout] and lasso [@Tibshirani1996Regression] regularizations.

Given a training set of samples $x_1,\ldots,x_n\in\mathbb{R}^d$ with labels $y_1,\ldots,y_n\in\mathbb{R}$, the droplasso regularization estimates a linear model $f(x)=w^\top x$ for $x\in\mathbb{R}^d$ by solving: \begin{equation} \min_{w \in \mathbb{R}^d} \left{ \frac{1}{n} \sum_{i=1}^{n} \underset{\delta_i \sim B(p)^d}{ \mathbb {E}} L\left(w,\delta_i \odot \frac{x_{i,}}{p} , y_{i} \right) + \lambda \left \| w\right \|_{1} \right} \,, \end{equation} where $L(w,x,y)$ is a negative log likelihood of a linear model $w$ on a observation $x$ with label $y$, $B(p)$ is the Bernoulli distribution with parameter $p$, and $\odot$ is the entry-wise multiplication of vectors. When $p=1$, each $\delta_i$ is almost surely a vector of $1$'s and droplasso boils down to classical lasso; and when $\lambda=0$, droplasso boils down to dropout.

Data simulation

Here we illustrate the use of droplasso on simulated data, and compare it to standard dropout and elastic net regularisation. We design a toy simulation to illustrate in particular how corruption by dropout noise impacts the performances of the different methods. The simulation goes as follow :

Let us simulate $n=100$ samples to form the training set, and $10,000$ samples to test the model:

library(mvtnorm)
generate_data <- function(n=100, d=100, d1=10, pi=1, w=0.05, q=0.4) {
  # The samples z
  mu <- rep(0,d)
  Sigma <- (matrix(1, nrow=d, ncol=d)  + diag(d))/2
  rawvars <- rmvnorm(n, mean=mu, sigma=Sigma)
  pvars <- pnorm(rawvars)
  z  <- qpois(pvars, pi)
  # The sparse model w
  w <-c(rep(w,d1),rep(0,d-d1))
  # The labels y
  y <- rbinom(n, 1, 1/(1+exp(-z %*%  w)) )
  # The corrupted samples x
  x <- sapply(1:d, function(i) z[,i] * rbinom(n,1,q))
  return(list(z=z, x=x, y=y, w=w))
}
data_train <- generate_data()
data_valid <- generate_data(n=10000)
data_test <- generate_data(n=10000)

Droplasso

Let us illustrate how to fit a Droplasso model and test it.

Droplasso for a given $p$

Let us train a droplasso model with default parameter ($p$=0.5) for a binary classification problem (i.e., using the logistic loss function). The function droplasso allows to directly fit the model over a grid of $\lambda$ to estimate the regularization path, so here we compute the regularization path over 50 values of $\lambda$:

library(droplasso)
m <- droplasso(data_train$x, data_train$y, family="binomial", nlambda=50)

Since we did not speficy a value for the regularization parameter $\lambda$, the function has computed all droplasso models over a grid of $\lambda$. We can visualize the resulting regularization path, i.e., the model weights as a function of $\lambda$:

plot(m, main="Droplasso p=0.5")

Let us see what the model predicts on the test set (for all $\lambda$ values)

ypred <- predict(m, data_test$x)

Since we know the true labels, we can compute the area under the ROC curve (AUC) to evaluate the quality of the predictions

library(ROCR)
pred <- sapply(seq(ncol(ypred)), function(j) {prediction(ypred[,j], data_test$y)})
auc <- unlist(lapply(pred, function(p) {performance(p, "auc")@y.values[[1]]}))
plot(log(m$lambda), auc, type='l', xlab="log(lambda)", ylab="AUC", main="Droplasso AUC on the test set")
grid()

We see that, as $\lambda$ decreases and the model goes from the null model to models with more variables, the AUC sharply increases and then decreases, which illustrates the benefits of regularization.

Varying $p$

So far we have only tested droplasso with the default dropout parameter $p$ set to 0.5 . However, $p$ also is also a regularization parameter that we can tune between 0 (high regularization) and 1 (no regularization). Note that when $p$ is small, the optimization is more challenging and it may be useful to increase the number of epoch of the proximal stochastic gradient descent algorithm by playing with the n_passes parameter of the droplasso function.

Let us see how the performance varies when we vary $p$. Since the convergence of the optimization is harder to reach for small values of $p$, the performance is subject to more fluctuation. In order to pick the "best" AUC, we therefore use a validation set to select the "best" $\lambda$, and then assess the AUC of that particular $\lambda$ only on the test set.

# The values of p we want to test
probalist <- 0.6^seq(0,10)
# The number of lambda values we want to test to form the regularization path
nlambda <- 100
# The number of epoch of the optimization procedure
n_passes = 10000

auclist <- c()
for (proba in probalist) {
  # Train on the train set
  m <- droplasso(data_train$x, data_train$y, family="binomial", keep_prob=proba, nlambda=nlambda, n_passes = n_passes)
  # Pick lambda with the best AUC on the validation set
  ypred <- predict(m, data_valid$x)
  pred <- sapply(seq(ncol(ypred)), function(j) {prediction(ypred[,j], data_valid$y)})
  auc <- unlist(lapply(pred, function(p) {performance(p, "auc")@y.values[[1]]}))
  bestlambda <- m$lambda[which.max(auc)]
  # Assess AUC of the best lambda on the test set
  ypred <- predict(m, data_test$x, s=bestlambda)
  auc <- performance(prediction(ypred, data_test$y), "auc")@y.values[[1]]
  auclist <- c(auclist,auc)
}
plot(log(probalist), auclist, type='b', xlab='ln(p)', ylab='Test AUC', main='Droplasso: AUC vs p for best lambda')

Comparison of Droplasso and Glmnet

As explained in [@Khalfaoui2018DropLasso], droplasso is related to elastic net regularization [@Zou2005Regularization]. Let us check how elastic net behaves on the same dataset.

library(glmnet)
# The values of alpha to interpolate between lasso (alpha=1) and ridge (alpha=0)
alphalist <- seq(0,1,0.1)
auclist <- c()
firstalpha <- TRUE
for (alpha in alphalist) {
  # Train glmnet model
  m_glm <- glmnet(data_train$x, data_train$y, family="binomial", intercept=F, alpha=alpha, standardize = F, nlambda=nlambda)
  # Predict on the validation set
  ypred <- data_valid$x %*% m_glm$beta
  pred <- sapply(seq(ncol(ypred)), function(j) {prediction(ypred[,j], data_valid$y)})
  auc <- unlist(lapply(pred, function(p) {performance(p, "auc")@y.values[[1]]}))
  if (length(auc) < nlambda) { # glmnet did not converge for the last lambda values
    auc <- c(auc, rep(0.5, nlambda-length(auc)))
  }
  if (firstalpha) {
    auc_el <- auc
    firstalpha <- FALSE
  } else {
    auc_el <- cbind(auc_el, auc)
  }

  # Assess AUC of the best lambda on the test set
  bestlambdaindex <- which.max(auc)
  ypred <- data_test$x %*% m_glm$beta[,bestlambdaindex]
  auc <- performance(prediction(ypred, data_test$y), "auc")@y.values[[1]]
  auclist <- c(auclist,auc)
}
plot(alphalist, auclist[1:11], type='b', xlab='alpha', ylab='Test AUC', main='Glmnet with best lambda')

We see that, similar to droplasso, glmnet works best when there is no lasso regularization, and a lot of ridge regularization, as shown on the following plot which details the performance of glmnet for various lambda and alpha.

matplot(auc_el, type='l', lty=1, xlab='lambda index', ylab='Validation AUC', main="Glmnet, different alpha's", legend=alphalist)
grid()

The best curve is for alpha=0 (ridge regression), and the best values for large $\lambda$ (left of the plot).

References



jpvert/droplasso documentation built on May 6, 2019, 7:17 a.m.