lasso_screenr: Fitting Screening Tools Using Lasso-Like Regularization of...
In sgutreuter/screenr: Construction of Binary Test-Screening Rules

lasso_screenr

R Documentation

Fitting Screening Tools Using Lasso-Like Regularization of Logistic Models

Description

lasso_screenr is a convenience function which combines logistic regression using L1 regularization, k-fold cross-validation, and estimation of the receiver-operating characteristic (ROC). The in-sample and out-of-sample performance is estimated from the models which produced the minimum AIC and minimum BIC. Execute methods(class = "lasso_screenr") to identify available methods.

Usage

lasso_screenr(
  formula,
  data = NULL,
  Nfolds = 10,
  L2 = TRUE,
  partial_auc = c(0.8, 1),
  partial_auc_focus = "sensitivity",
  partial_auc_correct = TRUE,
  boot_n = 4000,
  conf_level = 0.95,
  standardize = FALSE,
  seed = Sys.time(),
  ...
)

Arguments

`formula`	an object of class `stats::formula` defining the testing outcome and predictor variables.
`data`	a dataframe containing the variables defined in `⁠formula⁠`. The testing outcome must be binary (0 = no/negative, 1 = yes/positive) or logical (`⁠FALSE⁠`/`⁠TRUE⁠`). The the predictor variables are are typically binary or logical responses to questions which may be predictive of the test result, but numeric variables can also be used.
`Nfolds`	the number of folds used for k-fold cross validation. Default = 10; minimum = 2, maximum = 100.
`L2`	(logical) switch controlling penalization using the L2 norm of the parameters. Default: `⁠TRUE⁠`).
`partial_auc`	either a logical `⁠FALSE⁠` or a numeric vector of the form `c(left, right)` where left and right are numbers in the interval [0, 1] specifying the endpoints for computation of the partial area under the ROC curve (pAUC). The total AUC is computed if `partial_auc` = `⁠FALSE⁠`. Default: `c(0.8, 1.0)`
`partial_auc_focus`	one of `⁠"sensitivity"⁠` or `⁠specificity⁠`, specifying for which the pAUC should be computed. `partial_auc.focus` is ignored if `partial_auc` = `⁠FALSE⁠`. Default: `⁠"sensitivity"⁠`.
`partial_auc_correct`	logical value indicating whether the pAUC should be transformed the interval from 0.5 to 1.0. `partial_auc_correct` is ignored if `partial_auc` = `⁠FALSE⁠`. Default: `⁠TRUE⁠`).
`boot_n`	number of bootstrap replications for computation of confidence intervals for the (partial)AUC. Default: 4000.
`conf_level`	a number between 0 and 1 specifying the confidence level for confidence intervals for the (partial)AUC. Default: 0.95.
`standardize`	logical; if TRUE predictors are standardized to unit variance. Default: FALSE (sensible for binary and logical predictors).
`seed`	random number generator seed for cross-validation data splitting.
`...`	additional arguments passed to `glmpath`, `roc`, `auc` or `ci` .

Details

The results provide information from which to choose a probability threshold above which individual out-of-sample probabilies indicate the need to perform a diagnostic test. Out-of-sample performance is estimated using k-fold cross validation.

lasso_screenr uses the L1 path regularizer of Park and Hastie (2007), as implemented in the glmpath package. Park-Hastie regularization is is similar to the conventional lasso and the elastic net. It differs from the lasso with the inclusion of a very small, fixed (⁠1e-5⁠) penalty on the L2 norm of the parameter vector, and differs from the elastic net in that the L2 penalty is fixed. Like the elastic net, the Park-Hastie regularization is robust to highly correlated predictors. The L2 penalization can be turned off (L2 = FALSE), in which case the regularization is similar to the coventional lasso. Like all L1 regularizers, the Park-Hastie algorithm automatically "deletes" covariates by shrinking their parameter estimates to 0.

The coefficients produced by L1 regularization are biased toward zero. Therefore one might consider refitting the model selected by regularization using maximum-likelihood estimation as implemented in logreg_screenr.

The receiver-operating characteristics are computed using the pROC package.

By default, the partial area under the ROC curve is computed from that portion of the curve for which sensitivity is in the closed interval [0.8, 1.0]. However, the total AUC can be obtained using the argument partial_auc = FALSE. Partial areas can be computed for either ranges of sensitivity or specificity using the arguments partial_auc_focus and partial_auc. By default, partial areas are standardized.

Out-of-sample performance is estimated using k-fold cross-validation. For a gentle but Python-centric introduction to k-fold cross-validation, see https://machinelearningmastery.com/k-fold-cross-validation/.

Value

lasso_screenr returns (invisibly) an object of class lasso_screenr containing the components:

Call: The function call.
Prevalence: Prevalence of the binary response variable.
glmpathObj: An object of class glmpath returned by glmpath::glmpath. See help(glmpath) and methods(class = "glmpath").
Xmat: The matrix of predictors.
isResults: A list structure containing the results from the two model fits which produced the minimum AIC and BIC values, respectively. The results consist of Coefficients (the logit-scale parameter estimates, including the intercept), isPreds (the in-sample predicted probabilities) and isROC (the in-sample receiver-operating characteristic (ROC) of class roc).
RNG: Specification of the random-number generator used for k-fold data splitting.
RNGseed: RNG seed.
cvResults: A list structure containing the results of k- fold cross-validation estimation of out-of-sample performance.

The list elements of cvResutls are:

Nfolds: the number folds k
X_ho: the matrix of held-out predictors for each cross-validation fold
minAICcvPreds: the held-out responses and out-of-sample predicted probabilities from AIC-best model selection
minAICcvROC: the out-of-sample ROC object of class roc from AIC-best model selection
minBICcvPreds: the held-out responses and out-of-sample predicted probabilities from BIC-best model selection
minBICcvROC: the corresponding out-of-sample predicted probabilities and ROC object from BIC-best model selection

References

Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B. 2007;69(4):659-677. https://doi.org/10.1111/j.1467-9868.2007.00607.x

Kim J-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics and Data Analysis. 2009:53(11):3735-3745. http://doi.org/10.1016/j.csda.2009.04.009

Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Muller M. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12(77):1-8. http://doi.org/10.1186/1471-2105-12-77

Teferi W, Gutreuter S, Bekele A et al. Adapting strategies for effective and efficient pediatric HIV case finding: Risk screening tool for testing children presenting at high-risk entry points. BMC Infectious Diseases. 2022; 22:480. http://doi.org/10.1186/s12879-022-07460-w

Examples

## Not run: 
data(unicorns)
uniobj1 <- lasso_screenr(testresult ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q6 + Q7,
                          data = unicorns, Nfolds = 10)
methods(class = class(uniobj1))
summary(uniobj1)

## End(Not run)

sgutreuter/screenr documentation built on Oct. 19, 2024, 12:49 p.m.