knitr::opts_chunk$set( collapse = TRUE, message = FALSE, warnings = FALSE, comment = "#>", fig.width = 4, fig.height = 3.7, dev.args = list(pointsize = 8) ) library(lattice) trellis.par.set(fontsize = list(text = 8, points = 4))
Our first task is to fit a L1-regularized linear model to the spam data set from ElemStatLearn and analyze the results in terms of the selected features as well as test error and AUC. We will also compare our model to a naive model that predicts the most frequent class.
We begin by loading our libraries.
library(gsoc18saga) library(glmnet) library(ElemStatLearn)
We will use the ElemStatLearn::spam
dataset to try to classify email
as either spam or regular email. We load the data and set up balanced
train and test partitions using the caret package.
# extract the necessary data x <- as.matrix(spam[, -ncol(spam)]) y <- spam$spam n <- nrow(x) # create train and test sets train_id <- caret::createDataPartition(y, p = 0.75)[[1L]] train_x <- x[train_id, ] train_y <- y[train_id] test_x <- x[-train_id, ] test_y <- y[-train_id] # train the model set.seed(1) fit <- glmnet::cv.glmnet(train_x, train_y, family = "binomial")
The model's chosen features are
names(coef(fit)[coef(fit)[, 1] > 0, ])
Next we'll examine the performance of the model using
Receiver Operating Characteristics (ROC), Area Under the Curve (AUC), and
test error. We'll compare the model against a
naive classification scheme wherein we predict each observation (email)
to be the most prevalent category of the training set, namely.
r names(which.max(table(train_y)))
.
library(pROC) roc_glmnet <- roc(test_y, as.vector(predict(fit, test_x, type = "response")), ci = TRUE) roc_naive <- roc(test_y, rep(1, length(test_y)), ci = TRUE) plot(roc_glmnet, print.auc = TRUE)
The 95% confidence level for the AUC value for our
lasso model is $[r roc_glmnet$ci[1]
, r roc_glmnet$ci[3]
]$. We can compare
the glmnet model to the naive model using DeLong's test
roc.test(roc_glmnet, roc_naive)
which unsurprisingly shows that the glmnet model is significantly better.
Looking at the accuracy of the model, the naive model has an accuracy of
r sum(test_y == "email")/length(test_y)
whilst the lasso model
manages an accuracy of
r sum(predict(fit, test_x, type = "class") == test_y)/length(test_y)
.
In this part we are going to compare the performance of glmnet with
that of bigoptim for L2-regularized model fits. For this purpose,
we are going to use the gsoc18saga::covertype
data set and
and see how timings differ depending on the numbers of columns and rows we
use from the data. We will also study objective function values in the
same fashion.
The results for these simulations have been precomputed through code
that is available at https://github.com/jolars/gsoc18saga/blob/master/data-raw/medium.R.
The dataset is made part of this package and available as gsoc18saga::data_medium
.
Looking at the results from the simulation, we note that glmnet()
and
sag_fit()
(bigoptim) perform on par with one another
when either the rows or columns are relatively few.
As we add more observations (rows) and covariates (columns), however,
the SAG algorithm from bigoptim outperforms glmnet()
in speed.
gradpal <- colorRampPalette(RColorBrewer::brewer.pal(11, "RdBu"))(100) library(tidyr) library(dplyr) data_medium_diff <- data_medium %>% gather("var", "value", time, loss) %>% spread(package, value) %>% mutate(diff = glmnet - sag) %>% select(-glmnet, -sag) %>% spread(var, diff) z <- max(abs(data_medium_diff$time)) lattice::levelplot( time ~ cols*rows, data = data_medium_diff, asp = 1, at = seq(-z, z, length.out = 20), col.regions = gradpal, main = "Speed (glmnet - bigoptim)" )
In terms of loss, however, we see that bigoptim suffers when there are many features in the model, particularly when the number of observations are relatively low.
z <- max(abs(data_medium_diff$loss)) lattice::levelplot( loss ~ cols*rows, data = data_medium_diff, asp = 1, at = seq(-z, z, length.out = 20), col.regions = gradpal, main = "Loss (glmnet - bigoptim)" )
This section is similar to the previous one but relates to L1-regularization, rather than L2-regularization and instead compares glmnet with the logistic regression fitter of the python scikit-learn module. In particular, we will look at the SAGA algorithm. This time around, we will look at six different datasets:
Name Observations Features ---------- ---------------- --------- icjnn1 49,990 22 a9a 32,561 123 phishing 11,055 68 mushroooms 8,124 112 covtype 581,012 54 skin_nonskin 245,057 3
All of these have been collected from the libsvm binary dataset collection.
The R package reticulate is used to interface with python in the
data simulations, which, are featured at
https://github.com/jolars/gsoc18saga/blob/master/data-raw/medium-hard.R.
The dataset is made part of this package and available as
gsoc18saga::data_mediumhard
.
To put the two packages on equal footing, we now instead use the python-based implementation of glmnet to circumvent the possible overhead that calling python from R might introduce. For each dataset in our comparison, we iterate across a range of tolerance values (stopping criteria) for the optimizers in the two algorithms and collect the objective loss for each algorithm as well as the runtime at which they terminate.
library(latticeExtra) lattice::xyplot(loss ~ time | dataset, data_mediumhard, xlab = "Time (s)", ylab = "Objective loss", groups = package, scales = list(relation = "free"), auto.key = list(lines = TRUE, points = TRUE)) + glayer(panel.smoother(..., span = 0.5))
For all but the gsoc18saga::ijcnn1
dataset, the SAGA algorithm
converges faster towards the minimum than glmnet,
suggesting that there likely is benefit to be gained from
porting the SAGA algorithm to R.
The solution to the last test is presented in the package that
this vignette is part of. I have written the functions objective_r()
:
objective_r <- function(beta0, beta, x, y, lambda, alpha = 0) { n <- length(y) # binomial loglikelihood z <- beta0 + crossprod(x, beta) loglik <- sum(y*z - log(1 + exp(z))) # compute penalty penalty <- 0.5*(1 - alpha)*sum(beta^2) + alpha*sum(abs(beta)) -loglik/n + lambda*penalty }
and objective_cpp()
:
```{Rcpp cpp-objective, eval = FALSE}
// [[Rcpp::export]] double objective_cpp(double beta0, arma::vec beta, arma::mat x, arma::uvec y, double lambda, double alpha = 0) { int n = y.n_elem; arma::mat z = beta0 + x.t()beta; double loglik = arma::accu(y%z - arma::log(1 + arma::exp(z))); double penalty = 0.5(1 - alpha)arma::accu(arma::square(beta)) + alphaarma::accu(arma::abs(beta)); return -loglik/n + lambda*penalty; }
The concurrence of the two functions is examplified through the following lines, which have been added as a unit test to the package, the result of which you can view [here](https://travis-ci.org/jolars/gsoc18saga), as required by the GSoC test. ```r # fit a L1-regularized logistic regression x <- matrix(rnorm(100*20), 100, 20) y <- sample(1:2, 100, replace = TRUE) alpha <- 1 fit <- glmnet::cv.glmnet(x, y, family = "binomial", alpha = alpha) # collect parameters lambda <- fit$lambda.1se beta0 <- coef(fit, lambda)[1] beta <- coef(fit, lambda)[-1] # compute objective function values obj_r <- objective_r(beta0, beta, t(x), y, lambda, alpha = alpha) obj_cpp <- objective_cpp(beta0, beta, t(x), y, lambda, alpha = alpha) # check for equality all.equal(obj_r, obj_cpp)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.