lasso_screenr: Fitting Screening Tools Using Lasso-Like Regularization of...

View source: R/lasso_screenr.R

lasso_screenrR Documentation

Fitting Screening Tools Using Lasso-Like Regularization of Logistic Models

Description

lasso_screenr is a convenience function which combines logistic regression using L1 regularization, k-fold cross-validation, and estimation of the receiver-operating characteristic (ROC). The in-sample and out-of-sample performance is estimated from the models which produced the minimum AIC and minimum BIC. Execute methods(class = "lasso_screenr") to identify available methods.

Usage

lasso_screenr(
  formula,
  data = NULL,
  Nfolds = 10,
  L2 = TRUE,
  partial_auc = c(0.8, 1),
  partial_auc_focus = "sensitivity",
  partial_auc_correct = TRUE,
  boot_n = 4000,
  conf_level = 0.95,
  standardize = FALSE,
  seed = Sys.time(),
  ...
)

Arguments

formula

an object of class stats::formula defining the testing outcome and predictor variables.

data

a dataframe containing the variables defined in formula. The testing outcome must be binary (0 = no/negative, 1 = yes/positive) or logical (FALSE/TRUE). The the predictor variables are are typically binary or logical responses to questions which may be predictive of the test result, but numeric variables can also be used.

Nfolds

the number of folds used for k-fold cross validation. Default = 10; minimum = 2, maximum = 100.

L2

(logical) switch controlling penalization using the L2 norm of the parameters. Default: TRUE).

partial_auc

either a logical FALSE or a numeric vector of the form c(left, right) where left and right are numbers in the interval [0, 1] specifying the endpoints for computation of the partial area under the ROC curve (pAUC). The total AUC is computed if partial_auc = FALSE. Default: c(0.8, 1.0)

partial_auc_focus

one of "sensitivity" or specificity, specifying for which the pAUC should be computed. partial_auc.focus is ignored if partial_auc = FALSE. Default: "sensitivity".

partial_auc_correct

logical value indicating whether the pAUC should be transformed the interval from 0.5 to 1.0. partial_auc_correct is ignored if partial_auc = FALSE. Default: TRUE).

boot_n

number of bootstrap replications for computation of confidence intervals for the (partial)AUC. Default: 4000.

conf_level

a number between 0 and 1 specifying the confidence level for confidence intervals for the (partial)AUC. Default: 0.95.

standardize

logical; if TRUE predictors are standardized to unit variance. Default: FALSE (sensible for binary and logical predictors).

seed

random number generator seed for cross-validation data splitting.

...

additional arguments passed to glmpath, roc, auc or ci .

Details

The results provide information from which to choose a probability threshold above which individual out-of-sample probabilies indicate the need to perform a diagnostic test. Out-of-sample performance is estimated using k-fold cross validation.

lasso_screenr uses the L1 path regularizer of Park and Hastie (2007), as implemented in the glmpath package. Park-Hastie regularization is is similar to the conventional lasso and the elastic net. It differs from the lasso with the inclusion of a very small, fixed (1e-5) penalty on the L2 norm of the parameter vector, and differs from the elastic net in that the L2 penalty is fixed. Like the elastic net, the Park-Hastie regularization is robust to highly correlated predictors. The L2 penalization can be turned off (L2 = FALSE), in which case the regularization is similar to the coventional lasso. Like all L1 regularizers, the Park-Hastie algorithm automatically "deletes" covariates by shrinking their parameter estimates to 0.

The coefficients produced by L1 regularization are biased toward zero. Therefore one might consider refitting the model selected by regularization using maximum-likelihood estimation as implemented in logreg_screenr.

The receiver-operating characteristics are computed using the pROC package.

By default, the partial area under the ROC curve is computed from that portion of the curve for which sensitivity is in the closed interval [0.8, 1.0]. However, the total AUC can be obtained using the argument partial_auc = FALSE. Partial areas can be computed for either ranges of sensitivity or specificity using the arguments partial_auc_focus and partial_auc. By default, partial areas are standardized.

Out-of-sample performance is estimated using k-fold cross-validation. For a gentle but Python-centric introduction to k-fold cross-validation, see https://machinelearningmastery.com/k-fold-cross-validation/.

Value

lasso_screenr returns (invisibly) an object of class lasso_screenr containing the components:

Call

The function call.

Prevalence

Prevalence of the binary response variable.

glmpathObj

An object of class glmpath returned by glmpath::glmpath. See help(glmpath) and methods(class = "glmpath").

Xmat

The matrix of predictors.

isResults

A list structure containing the results from the two model fits which produced the minimum AIC and BIC values, respectively. The results consist of Coefficients (the logit-scale parameter estimates, including the intercept), isPreds (the in-sample predicted probabilities) and isROC (the in-sample receiver-operating characteristic (ROC) of class roc).

RNG

Specification of the random-number generator used for k-fold data splitting.

RNGseed

RNG seed.

cvResults

A list structure containing the results of k- fold cross-validation estimation of out-of-sample performance.

The list elements of cvResutls are:

Nfolds

the number folds k

X_ho

the matrix of held-out predictors for each cross-validation fold

minAICcvPreds

the held-out responses and out-of-sample predicted probabilities from AIC-best model selection

minAICcvROC

the out-of-sample ROC object of class roc from AIC-best model selection

minBICcvPreds

the held-out responses and out-of-sample predicted probabilities from BIC-best model selection

minBICcvROC

the corresponding out-of-sample predicted probabilities and ROC object from BIC-best model selection

References

Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B. 2007;69(4):659-677. https://doi.org/10.1111/j.1467-9868.2007.00607.x

Kim J-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics and Data Analysis. 2009:53(11):3735-3745. http://doi.org/10.1016/j.csda.2009.04.009

Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Muller M. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12(77):1-8. http://doi.org/10.1186/1471-2105-12-77

Teferi W, Gutreuter S, Bekele A et al. Adapting strategies for effective and efficient pediatric HIV case finding: Risk screening tool for testing children presenting at high-risk entry points. BMC Infectious Diseases. 2022; 22:480. http://doi.org/10.1186/s12879-022-07460-w

See Also

glmpath, roc and auc.

Examples

## Not run: 
data(unicorns)
uniobj1 <- lasso_screenr(testresult ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q6 + Q7,
                          data = unicorns, Nfolds = 10)
methods(class = class(uniobj1))
summary(uniobj1)

## End(Not run)


sgutreuter/screenr documentation built on Nov. 20, 2022, 2:41 a.m.