title: "hal9001
: Scalable highly adaptive lasso regression in R
"
tags:
- machine learning
- targeted learning
- causal inference
- R
authors:
- name: Nima S. Hejazi
orcid: 0000-0002-7127-2789
affiliation: 1, 2, 4
- name: Jeremy R. Coyle
orcid: 0000-0002-9874-6649
affiliation: 2
- name: Mark J. van der Laan
orcid: 0000-0003-1432-5511
affiliation: 2, 3, 4
affiliations:
- name: Graduate Group in Biostatistics, University of California, Berkeley
index: 1
- name: Division of Biostatistics, School of Public Health, University of California, Berkeley
index: 2
- name: Department of Statistics, University of California, Berkeley
index: 3
- name: Center for Computational Biology, University of California, Berkeley
index: 4
date: 25 September 2020
bibliography: refs.bib
The hal9001
R
package provides a computationally efficient implementation of
the highly adaptive lasso (HAL), a flexible nonparametric regression and
machine learning algorithm endowed with several theoretically convenient
properties. hal9001
pairs an implementation of this estimator with an array of
practical variable selection tools and sensible defaults in order to improve the
scalability of the algorithm. By building on existing R
packages for lasso
regression and leveraging compiled code in key internal functions, the hal9001
R
package provides a family of highly adaptive lasso estimators suitable for
use in both modern large-scale data analysis and cutting-edge research efforts
at the intersection of statistics and machine learning, including the emerging
subfield of computational causal inference [@wong2020computational].
The highly adaptive lasso (HAL) is a nonparametric regression function capable
of estimating complex (e.g., possibly infinite-dimensional) functional
parameters at a fast $n^{-1/3}$ rate under only relatively mild conditions
[@vdl2017generally; @vdl2017uniform; @bibaut2019fast]. HAL requires that the
space of the functional parameter be a subset of the set of càdlàg (right-hand
continuous with left-hand limits) functions with sectional variation norm
bounded by a constant. In contrast to the wealth of data adaptive regression
techniques that make strong local smoothness assumptions on the true form of the
target functional, HAL regression's assumption of a finite sectional variation
norm constitutes only a global smoothness assumption, making it a powerful and
versatile approach. The hal9001
package primarily implements a zeroth-order
HAL estimator, which constructs and selects by lasso penalization a linear
combination of indicator basis functions, minimizing the loss-specific empirical
risk under the constraint that the $L_1$-norm of the resultant vector of
coefficients be bounded by a finite constant. Importantly, the estimator is
formulated such that this finite constant is the sectional variation norm of the
target functional.
Intuitively, construction of a HAL estimator proceeds in two steps. First,
a design matrix composed of basis functions is generated based on the available
set of covariates. The zeroth-order HAL makes use of indicator basis functions,
resulting in a large, sparse matrix with binary entries; higher-order HAL
estimators, which replace the use of indicator basis functions with splines,
have been formulated, with implementation in a nascent stage. Representation of
the target functional $f$ in terms of indicator basis functions partitions the
support of $f$ into knot points, with such basis functions placed over subsets
of sections of $f$. Generally, numerous basis functions are created, with an
appropriate set of indicator bases then selected through lasso penalization.
Thus, the second step of fitting a HAL model is performing $L_1$-penalized
regression on the large, sparse design matrix of indicator bases. The selected
HAL regression model approximates the sectional variation norm of the target
functional as the absolute sum of the estimated coefficients of indicator basis
functions. The $L_1$ penalization parameter $\lambda$ can be data adaptively
chosen via a cross-validation selector [@vdl2003unified; @vdv2006oracle];
however, alternative selection criteria may be more appropriate when the
estimand functional is not the target parameter but instead a nuisance function
of a possibly complex parameter [e.g., @vdl2019efficient;
@ertefaie2020nonparametric]. An extensive set of simulation experiments were
used to assess the prediction performance of HAL regression [@benkeser2016hal];
these studies relied upon the subsequently deprecated halplus
R
package.
hal9001
's core functionalityThe hal9001
package, for the R
language and environment for statistical
computing [@R], aims to provide a scalable implementation of the HAL
nonparametric regression function. To provide a single, unified interface, the
principal user-facing function is fit_hal()
, which, at minimum, requires
a matrix of predictors X
and an outcome Y
. By default, invocation of
fit_hal()
will build a HAL model using indicator basis functions for up to
a limited number of interactions of the variables in X
, fitting the penalized
regression model via the lasso procedure available in the extremely popular
glmnet
R
package [@friedman2009glmnet]. As creation of the design matrix of
indicator basis functions can be computationally expensive, several utility
functions (e.g., make_design_matrix()
, make_basis_list()
, make_copy_map()
)
have been written in C++ and integrated into the package via the Rcpp
framework [@eddelbuettel2011rcpp; @eddelbuettel2013seamless]. hal9001
additionally supports the fitting of standard (Gaussian), logistic, and Cox
proportional hazards models (via the family
argument), including variations
that accommodate offsets (via the offset
argument) and partially penalized
models (via the X_unpenalized
argument).
Over several years of development and usage, it was found that the performance
of HAL regression can suffer in high-dimensional settings. To alleviate these
computational limitations, several screening and filtering approaches were
investigated and implemented. These include screening of variables prior to
creating the design matrix and filtering of indicator basis functions (via the
reduce_basis
argument) as well as early stopping when fitting the sequence of
HAL models in the $L_1$-norm penalization parameter $\lambda$. Future software
development efforts will continue to improve upon the computational aspects and
performance of the HAL regression options supported by hal9001
. Currently,
stable releases of the hal9001
package are made available on the Comprehensive
R
Archive Network at https://CRAN.R-project.org/package=hal9001, while both
stable (branch master
) and development (branch devel
) versions of the
package are hosted at https://github.com/tlverse/hal9001. Releases of the
package use both GitHub and Zenodo (https://doit.org/10.5281/zenodo.3558313).
As hal9001
is the canonical implementation of the highly adaptive lasso, the
package has been relied upon in a variety of statistical applications. Speaking
generally, HAL regression is often used in order to develop efficient estimation
strategies in challenging estimation and inference problems; thus, we interpret
statistical applications of HAL regression chiefly as examples of novel
theoretical developments that have been thoroughly investigated in simulation
experiments and with illustrative data analysis examples. In the sequel, we
briefly point out a few recently successful examples:
txshift
R package [@hejazi2020txshift-rpkg;
@hejazi2020txshift-joss], which relies upon hal9001
. These authors
additionally provide examples in simulation experiments and a re-analysis of
a recent HIV-1 vaccine efficacy trial using their proposed statistical
approach.As further theoretical advances continue to be made with HAL regression, and the resultant statistical methodology explored, we expect both the number and variety of such examples to steadily increase.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.