fit_hal: HAL: The Highly Adaptive Lasso
In jeremyrcoyle/mangolassi: The Scalable Highly Adaptive Lasso

fit_hal

R Documentation

HAL: The Highly Adaptive Lasso

Description

Estimation procedure for HAL, the Highly Adaptive Lasso

Usage

fit_hal(
  X,
  Y,
  formula = NULL,
  X_unpenalized = NULL,
  max_degree = ifelse(ncol(X) >= 20, 2, 3),
  smoothness_orders = 1,
  num_knots = num_knots_generator(max_degree = max_degree, smoothness_orders =
    smoothness_orders, base_num_knots_0 = 20, base_num_knots_1 = 10),
  reduce_basis = NULL,
  family = c("gaussian", "binomial", "poisson", "cox", "mgaussian"),
  lambda = NULL,
  id = NULL,
  weights = NULL,
  offset = NULL,
  fit_control = list(cv_select = TRUE, use_min = TRUE, lambda.min.ratio = 1e-04,
    prediction_bounds = "default"),
  basis_list = NULL,
  return_lasso = TRUE,
  return_x_basis = FALSE,
  yolo = FALSE
)

Arguments

`X`	An input `matrix` with dimensions number of observations -by- number of covariates that will be used to derive the design matrix of basis functions.
`Y`	A `numeric` vector of observations of the outcome variable. For `family="mgaussian"`, `Y` is a matrix of observations of the outcome variables.
`formula`	A character string formula to be used in `formula_hal`. See its documentation for details.
`X_unpenalized`	An input `matrix` with the same number of rows as `X`, for which no L1 penalization will be performed. Note that `X_unpenalized` is directly appended to the design matrix; no basis expansion is performed on `X_unpenalized`.
`max_degree`	The highest order of interaction terms for which basis functions ought to be generated.
`smoothness_orders`	An `integer`, specifying the smoothness of the basis functions. See details for `smoothness_orders` for more information.
`num_knots`	An `integer` vector of length 1 or `max_degree`, specifying the maximum number of knot points (i.e., bins) for any covariate for generating basis functions. If `num_knots` is a unit-length vector, then the same `num_knots` are used for each degree (this is not recommended). The default settings for `num_knots` are recommended, and these defaults decrease `num_knots` with increasing `max_degree` and `smoothness_orders`, which prevents (expensive) combinatorial explosions in the number of higher-degree and higher-order basis functions generated. This allows the complexity of the optimization problem to grow scalably. See details of `num_knots` more information.
`reduce_basis`	Am optional `numeric` value bounded in the open unit interval indicating the minimum proportion of 1's in a basis function column needed for the basis function to be included in the procedure to fit the lasso. Any basis functions with a lower proportion of 1's than the cutoff will be removed. Defaults to 1 over the square root of the number of observations. Only applicable for models fit with zero-order splines, i.e. `smoothness_orders = 0`.
`family`	A `character` or a `family` object (supported by `glmnet`) specifying the error/link family for a generalized linear model. `character` options are limited to "gaussian" for fitting a standard penalized linear model, "binomial" for penalized logistic regression, "poisson" for penalized Poisson regression, "cox" for a penalized proportional hazards model, and "mgaussian" for multivariate penalized linear model. Note that passing in family objects leads to slower performance relative to passing in a character family (if supported). For example, one should set `family = "binomial"` instead of `family = binomial()` when calling `fit_hal`.
`lambda`	User-specified sequence of values of the regularization parameter for the lasso L1 regression. If `NULL`, the default sequence in `cv.glmnet` will be used. The cross-validated optimal value of this regularization parameter will be selected with `cv.glmnet`. If `fit_control`'s `cv_select` argument is set to `FALSE`, then the lasso model will be fit via `glmnet`, and regularized coefficient values for each lambda in the input array will be returned.
`id`	A vector of ID values that is used to generate cross-validation folds for `cv.glmnet`. This argument is ignored when `fit_control`'s `cv_select` argument is `FALSE`.
`weights`	observation weights; defaults to 1 per observation.
`offset`	a vector of offset values, used in fitting.
`fit_control`	List of arguments, including the following, and any others to be passed to `cv.glmnet` or `glmnet`. `cv_select`: A `logical` specifying if the sequence of specified `lambda` values should be passed to `cv.glmnet` in order for a single, optimal value of `lambda` to be selected according to cross-validation. When `cv_select = FALSE`, a `glmnet` model will be used to fit the sequence of (or single) `lambda`. `use_min`: Specify the choice of lambda to be selected by `cv.glmnet`. When `TRUE`, `"lambda.min"` is used; otherwise, `"lambda.1se"`. Only used when `cv_select = TRUE`. `lambda.min.ratio`: A `glmnet` argument specifying the smallest value for `lambda`, as a fraction of `lambda.max`, the (data derived) entry value (i.e. the smallest value for which all coefficients are zero). We've seen that not setting `lambda.min.ratio` can lead to no `lambda` values that fit the data sufficiently well. `prediction_bounds`: An optional vector of size two that provides the lower and upper bounds predictions; not used when `family = "cox"`. When `prediction_bounds = "default"`, the predictions are bounded between `min(Y) - sd(Y)` and `max(Y) + sd(Y)` for each outcome (when `family = "mgaussian"`, each outcome can have different bounds). Bounding ensures that there is no extrapolation.
`basis_list`	The full set of basis functions generated from `X`.
`return_lasso`	A `logical` indicating whether or not to return the `glmnet` fit object of the lasso model.
`return_x_basis`	A `logical` indicating whether or not to return the matrix of (possibly reduced) basis functions used in `fit_hal`.
`yolo`	A `logical` indicating whether to print one of a curated selection of quotes from the HAL9000 computer, from the critically acclaimed epic science-fiction film "2001: A Space Odyssey" (1968).

Details

The procedure uses a custom C++ implementation to generate a design matrix of spline basis functions of covariates and interactions of covariates. The lasso regression is fit to this design matrix via cv.glmnet or a custom implementation derived from origami. The maximum dimension of the design matrix is n -by- (n * 2^(d-1)), where where n is the number of observations and d is the number of covariates.

For smoothness_orders = 0, only zero-order splines (piece-wise constant) are generated, which assume the true regression function has no smoothness or continuity. When smoothness_orders = 1, first-order splines (piece-wise linear) are generated, which assume continuity of the true regression function. When smoothness_orders = 2, second-order splines (piece-wise quadratic and linear terms) are generated, which assume a the true regression function has a single order of differentiability.

num_knots argument specifies the number of knot points for each covariate and for each max_degree. Fewer knot points can significantly decrease runtime, but might be overly simplistic. When considering smoothness_orders = 0, too few knot points (e.g., < 50) can significantly reduce performance. When smoothness_orders = 1 or higher, then fewer knot points (e.g., 10-30) is actually better for performance. We recommend specifying num_knots with respect to smoothness_orders, and as a vector of length max_degree with values decreasing exponentially. This prevents combinatorial explosions in the number of higher-degree basis functions generated. The default behavior of num_knots follows this logic — for smoothness_orders = 0, num_knots is set to 500 / 2^{j-1}, and for smoothness_orders = 1 or higher, num_knots is set to 200 / 2^{j-1}, where j is the interaction degree. We also include some other suitable settings for num_knots below, all of which are less complex than default num_knots and will thus result in a faster runtime:

Some good settings for little to no cost in performance:
- If smoothness_orders = 0 and max_degree = 3, num_knots = c(400, 200, 100).
- If smoothness_orders = 1+ and max_degree = 3, num_knots = c(100, 75, 50).
Recommended settings for fairly fast runtime:
- If smoothness_orders = 0 and max_degree = 3, num_knots = c(200, 100, 50).
- If smoothness_orders = 1+ and max_degree = 3, num_knots = c(50, 25, 15).
Recommended settings for fast runtime:
- If smoothness_orders = 0 and max_degree = 3, num_knots = c(100, 50, 25).
- If smoothness_orders = 1+ and max_degree = 3, num_knots = c(40, 15, 10).
Recommended settings for very fast runtime:
- If smoothness_orders = 0 and max_degree = 3, num_knots = c(50, 25, 10).
- If smoothness_orders = 1+ and max_degree = 3, num_knots = c(25, 10, 5).

Value

Object of class hal9001, containing a list of basis functions, a copy map, coefficients estimated for basis functions, and timing results (for assessing computational efficiency).

Examples

n <- 100
p <- 3
x <- xmat <- matrix(rnorm(n * p), n, p)
y_prob <- plogis(3 * sin(x[, 1]) + sin(x[, 2]))
y <- rbinom(n = n, size = 1, prob = y_prob)
hal_fit <- fit_hal(X = x, Y = y, family = "binomial")
preds <- predict(hal_fit, new_data = x)

jeremyrcoyle/mangolassi documentation built on Nov. 22, 2024, 10:14 a.m.