ensr: Elastic Net SearcheR

#'---
#'title: "Elastic Net SearcheR Examples"
#'author: "Peter E. DeWitt"
#'output:
#'  rmarkdown::html_vignette:
#'    toc: true
#'    number_sections: true
#'bibliography: references.bib
#'vignette: >
#'  %\VignetteEngine{knitr::rmarkdown}
#'  %\VignetteIndexEntry{ensr-examples}
#'  %\VignetteEncoding{UTF-8}
#'---

#'
#'ensr package version
{{ packageVersion('ensr') }}
#'

#+ label=setup, include = FALSE
library(knitr)
knitr::opts_chunk$set(collapse = TRUE, fig.width = 6, fig.height = 4)
library(qwraps2)
options(qwraps2_markup = "markdown")

#'
#' The primary purpose of the
{{ qwraps2::Rpkg(ensr) }}
#' package is to provide methods for simultaneously searching
#' for preferable values of $\lambda$ and $\alpha$ in elastic net regression.
{{ qwraps2::Rpkg(ensr) }}
#' is wrapped around the
{{ qwraps2::CRANpkg(ensr) }}
#' package.  This vignette starts with a summary of elastic net regression and
#' its use and limitations.  Examples of data set preparation follow and the
#' vignette concludes with elastic net regression results.
#'
#+label="load_and_attach_ensr"
library(ensr)
library(data.table)
library(ggplot2)
library(ggforce)
library(doMC)
registerDoMC(cores = max(c(detectCores() - 2L, 1L)))
options(datatable.print.topn  = 3L,
        datatable.print.nrows = 3L)

#'
# /*
# Section: Elastic Net Regression ----------------------------------------- {{{
# */
#'
#' # Elastic Net Regression
#'
#' Elastic Net Regression [@friedman2010regularization] is a penalized linear
#' modeling approach that is a mixture of ridge regression [@hoerl1970ridge],
#' and least absolute shrinkage and selection operator (LASSO) regression [@tibshirani1996regression].
#' Ridge regression reduces the impact of collinearity on model parameters and LASSO
#' reduces the dimensionality of the support by shrinking some of the regression
#' coefficients to zero.  Elastic net does both of these by solving the
#' following equation (for Gaussian responses):
#' $$\min_{\beta_0, \beta \in \mathbb{R}^{p+1}} \frac{1}{2N} \sum_{i = 1}^{N}
#' \left( y_i - \beta_0 - x_i^T \beta \right)^2 + \lambda \left[ \left(1 -
#' \alpha \right) \frac{\left \lVert \beta \right \rVert_{2}^{2}}{2} + \alpha
#' \left \lVert \beta \right \rVert_{1} \right],$$
#' where $\lambda \geq 0$ is the complexity parameter and $0 \leq \alpha \leq 1$
#' is the compromise between ridge $\left(\alpha = 0\right)$ and LASSO $\left(
#' \alpha = 1 \right).$
#'
#' Ridge regression does not often shrink coefficients to zero and contribute to
#' the parsimony of models. One potential benefit of elastic net regression is
#' that, like the LASSO, it can be used to perform variable selection by
#' shrinking coefficients to zero.  Compared to LASSO, one potential benefit of
#' elastic net regression is that it will reproducibly return the same set of
#' non-zero coefficients when some predictors are highly correlated.  LASSO,
#' $\alpha = 1,$ may return different sets of non-zero coefficients when highly
#' correlated predictors are in the model.
#'
#' Compared to other machine learning approaches, a potential benefit of elastic
#' net regression is that the $\beta$ vector is easily interpretable and can be
#' implemented in almost any downstream computational pipeline. More flexible
#' machine learning models such as gradient boosting machines may be able to fit
#' data more accurately, but they are extremely difficult to export to other
#' tools.
#'
#' The
{{ qwraps2::backtick(cv.glmnet)}}
#' call from the
{{ qwraps2::CRANpkg(glmnet) }}
#' package is widely used to fit elastic net regression models. However, the
#' current implementation of
{{ qwraps2::backtick(cv.glmnet)}}
#' requires that the value(s) of $\alpha$ be specified by the user (see
#' "Details" in
{{ qwraps2::backtick(help("cv.glmnet")) }}
#' ).  We designed the
{{ qwraps2::CRANpkg(ensr) }}
#' package to fill this gap by simultaneously searching for a
#' preferable set of $\lambda$ and $\alpha$ values.
{{ qwraps2::CRANpkg(ensr) }}
#' also provides additional plotting methods to facilitate visual identification
#' of the best choice for a given project.
#'
# /*
# -------------------------------------------------------------------------- }}}
# */
# /*
# Section: Data Sets ------------------------------------------------------- {{{
# */
#' # Data Sets
#'
#' Two data sets are provided in the
{{ qwraps2::Rpkg(ensr) }}
#' package for use in examples.
#'
#' 1.
{{ qwraps2::backtick(tbi)}}
#' is a synthetic data set with which traumatic brain injury can be classified
#' into three different types using a set of predictors.
#'
#' 2.
{{ qwraps2::backtick(landfill)}}
#' is a synthetic data set similar to those generated by computer models of
#' water percolating through landfill.
#'
#' More information about each of these data sets is
#' provided in the "ensr-datasets" vignette:
{{ paste0(qwraps2::backtick(vignette("ensr-datasets", package = "ensr")), ".") }}
#'
#+ label = "load_data"
data(tbi, landfill., package = "ensr")

# /*
# -------------------------------------------------------------------------- }}}
# */
#'
# /*
# Section: Searching for lambda and alpha ---------------------------------- {{{
# */
#'
#' # Searching for $\lambda$ and $\alpha$
#'
#' The optimal pair of $\lambda$ and $\alpha$ is likely project-specific.  Some
#' defaults are provided but the user is encouraged to carefully consider, for
#' example, the optimal balance between parsimony and model error for their
#' project.  For example, model error that is lower by 0.1\% at the expense of 3
#' additional parameters may or may not be desirable.
#'
# /*
# {{{ ----------------- Subsection: Univariate Response ------------------------
# */
#'
#' ## Univariate Response
#'
#' A call to
{{ qwraps2::backtick(ensr) }}
#' produces a search for a combination of $\lambda$ and $\alpha$
#' that result in the lowest cross validation error.  The arguments to
#' `ensr` are the same as those made to `cv.glmnet` with the addition of
#' `alphas`, a sequence of $\alpha$ values to use.  Please note that `ensr` will
#' add `length(alphas) - 1` additional values, the midpoints between the given
#' set, in the construction of a $\lambda$--$\alpha$ grid to search.  For the
#' initial example we will fit an elastic net for modeling the evaporation in
#' the landfill data constructed above.
set.seed(42)
y_matrix <- as.matrix(landfill$evap)
x_matrix <- as.matrix(landfill[, topsoil_porosity:weather_temp])

ensr_obj <- ensr(y = y_matrix, x = x_matrix, standardize = FALSE)
ensr_obj

#'
#' `ensr_obj` is a `ensr` object, which is a list of `cv.glmnet` objects.
#' The length of the list is determined by the length of the $\alphas$ argument.
#' The default for `alphas` is `r deparse(as.list(args(ensr))$alphas)`.
#'
#' The summary method for `ensr` objects returns a `data.table` with values of
#' $\lambda$, $\alpha$, the mean cross-validation error `cvm`, and the number of
#' non-zero coefficients.  The `l_index` is the list index of the `ensr` object
#' associated with the noted $\alpha$ value.
ensr_obj_summary <- summary(object = ensr_obj)
ensr_obj_summary

#'
#' The preferable model may be the one with the minimum cross-validation error.
ensr_obj_summary[cvm == min(cvm)]

#'
#' A quick way to get the preferable model is to call the `preferable` method.
str(preferable(ensr_obj), max.level = 1L)

#'
#' `preferable` returns a `elnet` `glmnet` object with one additional list element, the
#' `ensr_summary` used to select this preferable model.
#'
#' Because the output of `preferable` inherits the same class as an object
#' returned from a call to `glmnet::glmnet` the same methods can be used.
#' Plotting methods are one example:
par(mfrow = c(1, 3))
plot(preferable(ensr_obj), xvar = "norm")
plot(preferable(ensr_obj), xvar = "lambda")
plot(preferable(ensr_obj), xvar = "dev")

#'
#' Another graphical way to look at the results of the `ensr` is to use the
#' `ensr`-provided plotting method.  In the plot below, each of the $\lambda$ (y-axis,
#' $\log_{10}$ scale) and $\alpha$ (x-axis) values considered in the `ensr_obj`
#' are plotted.  The coloring is denoted as `log10(z)` where `z = (cvm -
#' min(cvm)) / sd(cvm)`.   The color scale is set to have low values (values
#' near the minimum mean cross validation error) be dark green. Values moving
#' further from the minimum are lighter green, then white, then purple. A red cross
#' identifies the minimum mean cross-validation error.
plot(ensr_obj)

#'
#' The `ensr` `plot` method produces a `ggplot` object and thus can be customized.
#' In this example, we add the black-and-white theme and use `ggforce::facet_zoom` to zoom in
#' on a section of the graphic:
#+ eval = FALSE
# /*
if (FALSE) {
# */
plot(ensr_obj) +
  theme_minimal() +
  facet_zoom(x = 0.50 < alpha & alpha < 0.90, y = 5e-4 < lambda & lambda < 1.5e-3)
# /*
}
# */

#'
#' In this figure we see the minimum mean cross validation error occurs within
#' the models with $\alpha$ = `r summary(ensr_obj)[cvm == min(cvm), "alpha"]`.
summary(ensr_obj)[cvm == min(cvm)]

#'
#' Inspection of the plot suggests there is another minimum worth considering, for
#' $\alpha$ = `r summary(ensr_obj)[l_index == 16][cvm == min(cvm), "alpha"]`.
summary(ensr_obj)[, .SD[cvm == min(cvm)], by = alpha][l_index %in% c(13, 16)]

#'
#' The difference in the mean cross validation error between these two results
#' is very small and may not be meaningful. However, the number of non-zero (`nzero`)
#' coefficients is quite different.  With a very small
#' increase in the mean cross validation error, one more variable has its
#' regression coefficient shrunk to zero.  If parsimony is your primary
#' objective, the second model might be preferable.
#'
#' We can also look at the mean cross validation errors by `nzero`.
summary(ensr_obj)[, .SD[cvm == min(cvm)], by = nzero]

#'
#' The `plot` method for `ensr` objects has a `type` argument.  The default is
#' `type = 1` as plotted above.  `type = 2` plots:
plot(ensr_obj, type = 2)

#'
#' Some customization to the plot:
plot(ensr_obj, type = 2) +
  theme_bw() +
  aes(x = nzero, y = cvm) +
  geom_point() +
  geom_line() +
  facet_zoom(xy = cvm < 0.05)

#'
#' Based on the figure above, if the objective is lowest cross validation error
#' and parsimony, the model with 14 non-zero coefficients
#' may be the preferable model. The additional non-zero coefficient in the model
#' with 15 or more non-zero coefficients does not meaningfully reduce
#' the mean cross validation error. Further examination shows that the model with
#' only four non-zero coefficients might also be a reasonable choice.
summary(ensr_obj)[nzero %in% c(4, 14)] [, .SD[cvm == min(cvm)], by = nzero]

#'
#' A quick side note: you can get both types of plots in one call:
plot(ensr_obj, type = c(1, 2))

#'
#' To obtain the coefficients from the above models:
landfill_evap_coef4  <- coef(ensr_obj[[19]], s = 0.0023788208)
landfill_evap_coef14 <- coef(ensr_obj[[16]], s = 0.0007415985)

#'
#' The table below shows the variables for each model in descending order of the
#' absolute value of the regression coefficients.  Because the predictors were
#' standardized, the relative magnitude of the coefficients can serve as a
#' sensitivity/influence/importance metric.
qwraps2::qable(
               data.table(
                          variable = c(rownames(landfill_evap_coef4),
                                       rownames(landfill_evap_coef14)),
                          value    = c(as.vector(matrix(landfill_evap_coef4)),
                                       as.vector(matrix(landfill_evap_coef14))),
                          nzero    = c(rep(4, 36), rep(14, 36))
                          )[
                            abs(value) >= 1e-10 & variable != "(Intercept)"
                            ][
                            order(nzero, -abs(value))
                            ][
                            ,
                            nzero := NULL
                            ]
               ,
               rgroup = c("nzero = 4" = 4, "nzero = 14"= 14),
               rnames = c(1:4, 1:14))

#'
#' The variables most important for modeling evaporation are `weather_temp`
#' (average temperature over the last 100 years),
#' `wind` (average wind speed), `weather_solrad` (average solar radiation), and
#' `rh` (relative humidity).  The fifth and higher coefficients are considerably
#' smaller and less important than the first four.
#'
#/*
# End of Univariate Response ----------------------------------------------- }}}
#*/
# /*
# {{{ --------------- Subsection: Cross Validation Issues ----------------------
# */
#'
#' ## Cross Validation Issues
#'
#' Cross-validation results may be dependent on the `foldid` randomly assigned to
#' each record. This is because some records may always be
#' considered together in either the training or validation data sets.  For example, three
#' randomly generated vectors for 10-fold cross-validation are generated below
#' and the results from calls to `ensr` are shown below.
foldid1 <- sample(seq(10), size = nrow(x_matrix), replace = TRUE)
foldid2 <- sample(seq(10), size = nrow(x_matrix), replace = TRUE)
foldid3 <- sample(seq(10), size = nrow(x_matrix), replace = TRUE)

ensr_obj_1 <- ensr(y = y_matrix, x = x_matrix, standardize = FALSE, foldid = foldid1)
ensr_obj_2 <- ensr(y = y_matrix, x = x_matrix, standardize = FALSE, foldid = foldid2)
ensr_obj_3 <- ensr(y = y_matrix, x = x_matrix, standardize = FALSE, foldid = foldid3)

summary(ensr_obj_1)[cvm == min(cvm)]
summary(ensr_obj_2)[cvm == min(cvm)]
summary(ensr_obj_3)[cvm == min(cvm)]

#'
#' There are small differences in the cross validation errors and in the
#' $\lambda$ values.  There is a large difference, however, in the $\alpha$ values.  It
#' is notable that the differences in the regression coefficients is minor.
#' In this case, the number of non-zero coefficients does not change and the
#' coefficient magnitudes are similar.
cbind(coef(ensr_obj_1), coef(ensr_obj_2), coef(ensr_obj_3))

#'
#' One could argue there is a major difference in the result between the two
#' foldid's.  Using the cvm, `foldid3` leads to 18 non-zero coefficients
#' whereas foldid1 and foldid2 leads to only 14 non-zero coefficients.
#'
#' Because of these issues, we recommend multiple cross-validation runs or bootstrapping to select a
#' final model.
#'
#/*
# End of Cross Validation Issues ------------------------------------------- }}}
#*/
# /*
# {{{ ---------------- Subsection: Multivariate Response -----------------------
# */
#'
#' ## Multivariate Response
#'
#'
#' There are three outcomes, injuries, in the `tbi` data set.  It would be
#' reasonable to assume that there should be common variables with non-zero
#' coefficients for models of each injury.  The end user could fit three
#' univariate models or fit one multinomial model using the tools provided by
#' `r Rpkg(glmnet)`.
#'
#' To illustrate these options we will run `ensr` five times: three univariate
#' models, one multinomial model with `type.multinomial` set to the default
#' "ungrouped," and one grouped multinomial model.
#+ eval = FALSE
# /* Omit the multivar section until #14 is fixed.
if (FALSE) {
# */
ymat_1 <- matrix(tbi$injury1, ncol = 1)
ymat_2 <- matrix(tbi$injury2, ncol = 1)
ymat_3 <- matrix(tbi$injury3, ncol = 1)
ymat_123 <- as.matrix(tbi[, c("injury1", "injury2", "injury3")], ncol = 3)
xmat     <- as.matrix(tbi[, -c("age", "los", "injury1", "injury2", "injury3")])

foldid <- sample(seq(10), size = nrow(ymat_1), replace = TRUE)

fit_1 <- ensr(y = ymat_1, x = xmat,
              standardize = FALSE, standardize.response = FALSE,
              foldid = foldid, family = "binomial",
              parallel = TRUE)

fit_2 <- ensr(y = ymat_2, x = xmat,
              standardize = FALSE, standardize.response = FALSE,
              foldid = foldid, family = "binomial",
              parallel = TRUE)

fit_3 <- ensr(y = ymat_3, x = xmat,
              standardize = FALSE, standardize.response = FALSE,
              foldid = foldid, family = "binomial",
              parallel = TRUE)

fit_123_ungrouped <-
  ensr(y = ymat_123, x = xmat,
       standardize = FALSE, standardize.response = FALSE,
       foldid = foldid,
       family = "multinomial",
       type.multinomial = "ungrouped",
       parallel = TRUE)

fit_123_grouped <-
  ensr(y = ymat_123, x = xmat,
       standardize = FALSE, standardize.response = FALSE,
       foldid = foldid,
       family = "multinomial",
       type.multinomial = "grouped",
       parallel = TRUE)

#'
#' Plots of the results show that for injury2 and injury3, an
#' $\alpha$ of 1 would be preferable. For injury1, a slightly lower $\alpha$
#' is preferable. When fitting the multinomial responses, grouped or
#' ungrouped, it appears that the preferable $\alpha$ is similar to the univariate
#' fits.
#'
#+ fig.width = 8, fig.height = 8, eval = FALSE
gridExtra::grid.arrange(plot(fit_1) + ggplot2::ggtitle("Fit 1"),
                        plot(fit_2) + ggplot2::ggtitle("Fit 2"),
                        plot(fit_3) + ggplot2::ggtitle("Fit 3"),
                        plot(fit_123_ungrouped) + ggplot2::ggtitle("Fit 123 Ungrouped"),
                        plot(fit_123_grouped) + ggplot2::ggtitle("Fit 123 Grouped"),
                        layout_matrix = rbind(c(1, 1, 2, 2, 3, 3),
                                              c(4, 4, 4, 5, 5, 5)))

#'
#' The `summary` model output:
#+ eval = FALSE
all_summaries <-
  rbindlist(list(summary(fit_1),
                 summary(fit_2),
                 summary(fit_3),
                 summary(fit_123_ungrouped),
                 summary(fit_123_grouped)),
            idcol = "fit")

all_summaries[, fit := factor(fit, 1:5, c("Fit 1", "Fit 2", "Fit 3", "Fit 123 Ungrouped", "Fit 123 Grouped"))]

#'
#' The models with the lowest mean cross validation error are:
#+ eval = FALSE
all_summaries[, .SD[cvm == min(cvm)], by = fit]

#'
#'
#+ eval = FALSE
ggplot(
       all_summaries[, .SD[cvm == min(cvm)], by = c("fit", "nzero")]
       ) +
  theme_bw() +
  aes(x = nzero, y = cvm, color = factor(fit), linetype = factor(fit)) +
  geom_point() +
  geom_line()

#'
#' It appears that either an ungrouped model with 9 non-zero coefficients or a grouped
#' model with 14 non-zero coefficients would be preferable.
#'
#' Here are the models with the lowest
#' mean cross validation error and eight non-zero coefficients:
#+ eval = FALSE
pref_models <-
  all_summaries[nzero == 8,
                .SD[cvm == min(cvm)],
                by = c("fit")]
pref_models

#'
#' Let's look at the coefficients for these models, starting with the three
#' univariate models
#+ eval = FALSE
cbind(
      coef(fit_1[[pref_models$l_index[1]]], s = pref_models$lambda[1])
      ,
      coef(fit_2[[pref_models$l_index[2]]], s = pref_models$lambda[2])
      ,
      coef(fit_3[[pref_models$l_index[3]]], s = pref_models$lambda[3])
      )

#'
#' The following are the non-zero coefficients for the ungrouped models.
#+ eval = FALSE
do.call(cbind,
        coef(fit_123_ungrouped[[pref_models$l_index[4]]], s = pref_models$lambda[4]))

#'
#' The grouped results are:
#+ eval = FALSE
do.call(cbind,
        coef(fit_123_grouped[[pref_models$l_index[5]]], s = pref_models$lambda[5]))

# /* end of the if (FALSE) to omit the multivariate section
}
# */
#'
#' `ensr` can also analyze multivariate Gaussian responses.
#' See the documentation `help("glmnet", package = "glmnet")` for details.
#'
# /*
# End of Multivariate Response ----------------------------------------------}}}
# */
# /*
# End of Searching for lambda and alpha--------------------------------------}}}
# */

#/*
# {{{ An alternative approach --------------------------------------------------
#*/
#' # Alternative Approaches
#'
#' Our `r Rpkg(ensr)` package is not the only approach to searching for $\labmda$ and
#' $\alpha$. The
#' [`r Rpkg(glmnetUtils)`](https://cran.r-project.org/package=glmnetUtils)
#' [@glmnetUtils] is the most notable comparative package.  We encourage the
#' reader to explore the `r Rpkg(glmnetUtils)` package and determine if it meets your
#' needs.
#'
#' There are two major differences in the implementation of `r Rpkg(ensr)` and
#' `r Rpkg(glmnetUtils)`.  The first major difference is that
#' `r Rpkg(glmnetUtils)` allows users to specify `glmnet::glmnet` models with a
#' formula, e.g., `y ~ x1 + x2 + x3`, whereas `r Rpkg(ensr)` maintains the `r Rpkg(glmnet)`
#' requirement of the user providing y and x matrices.  We opted for the
#' simplicity of staying with `glmnet::glmnet` arguments for programming efficiency and as
#' a check on reasonable models.  It is likely that a model with a factor on the
#' right-hand side of the formula should not be evaluated with elastic net.
#' There does not appear to be a check or warning in `r Rpkg(glmnetUtils)` for factor or
#' character (which will be coerced to a factor) variables on the right-hand
#' side of the formula statement.  `r Rpkg(ensr)` and `r Rpkg(glmnet)`, by requiring the user to
#' specify the response and support matrices, forces the user to be aware of and
#' explicitly handle possible character/factor predictor variables.  Binary
#' factors are non-trivial in this context as well, but are considerably less
#' difficult to deal with then factors with three or more possible values.
#'
#' The second major difference between `r Rpkg(ensr)` and `r Rpkg(glmnetUtils)` is that
#'`r Rpkg(ensr)` builds a $\labmda$-$\alpha$ grid and evaluates
#' `glmnet::cv.glmnet` at least twice for each value of $\lambda.$
#' For each value of $\lambda$, at least two values of
#' $\alpha$ will be considered.  `r Rpkg(glmnetUtils)` only uses the default
#' $\labmda$ values for each specific $\alpha$ value.  By constructing a
#' $\labmda$-$\alpha$ grid, `r Rpkg(ensr)` provides minimally sufficient support for
#' estimating a contour plot for a (x = $\alpha,$ y = $\lambda,$ z =
#' cross-validation mean error) surface.  The `r Rpkg(glmnetUtils)` results require
#' imputation before such a surface could be estimated.  We feel that `r Rpkg(ensr)`'s use
#' of a $\labmda$-$\alpha$ grid is a significant advantage relative to `r Rpkg(glmnetUtils)`.
#'
#' See the example script `compare-to-glmnetUtils.R` in the `examples` directory
#' on the `r Rpkg(ensr)` github page: https://github.com/dewittpe/ensr.
#'
#/*
# end An alternative approach ---------------------------------------------- }}}
#*/

# /*
# --------------------------- Session Information -----------------------------
# */
#'
#' # Session Info
#'
print(sessionInfo(), local = FALSE)

#'
#' # References
#'
# /*
# ------------------------------- End of File ----------------------------------
# */