evalload = TRUE ; evalmodel = FALSE ; # evalload = FALSE ; evalmodel = TRUE ; knitr::opts_chunk$set(echo = TRUE) #library(glmnetr) #library(survival) source("~/Documents/Rpackages/glmnetr_supl/_load_glmnetr_vig_241228.R" , echo=TRUE) #library(Matrix)
The stepreg() and cv.stepreg() funcitons in the glmnetr package were written for convenience and stability as opposed to speed or broad applicability. When fitting lasso models we wanted to compare these to standard stepwise regression models. Keeping a more modern approach we tune by either number of terms included in the model (James, Witten, Hastie and Tibshirani, An Introduction to Statistical Learning with applications in R, 2nd ed., Springer, New York, 2021) or by the p critical value for model inclusion, as this too is a common tuning parameter when fitting stepwise models.
When fitting lasso models we often use one-hot coding for predictor factors when setting up the design matrix. This allows lasso to identify and add to the model a term for any one group that might be particularly different from the others. By the penalty lasso stabilizes the model coefficients and keeps them from going to infinity, while ridge will generally uniquely identify coefficients despite any strict collinearities.
Before writing this program we tried different available packages to fit stepwise models for the Cox repression framework but all we tried had difficulties with numerical stability for the large and wide clinical datasets we were working with, and which involved one-hot coding. There may well be a package that would be stable for the data we were analyzing but we decided to write this small function to be able to tune for stability.
This program is slow but our goal was not for routine usage but to use the stepwise procedure on occasion as a reference for the lasso models. For many clinical datasets the lasso clearly outperformed the stepwise procedure, and ran much faster. For many simulated data sets with simplified covariance structures, i.e. independence of the underlying predictors, the lasso did not appear to do much better than the stepwise procedure tuned by number of model terms or p.
The data requirements for stepreg() and cv.stepreg() are similar to those of cv.glmnetr() and we refer to the Using glmnetr vignette for a description.
To demonstrate usage of cv.stepreg we first generate a data set for analysis, run an analysis and evaluate. Following the Using glmnetr vignette, the code
# Simulate data for use in an example survival model fit # first, optionally, assign a seed for random number generation to get applicable results set.seed(116291950) simdata=glmnetr.simdata(nrows=1000, ncols=100, beta=NULL)
generates simulated data for analysis. We extract data in the format required for input to the cv.stepreg (and glmnetr) programs.
# Extract simulated survival data xs = simdata$xs # matrix of predictors y_ = simdata$yt # vector of survival times event = simdata$event # indicator of event vs. censoring
Inspecting the predictor matrix we see
# Check the sample size and number of predictors print(dim(xs)) # Check the rank of the design matrix, i.e. the degrees of freedom in the predictors Matrix::rankMatrix(xs)[[1]] # Inspect the first few rows and some select columns print(round(xs[1:10,c(1:12,18:20)],digits=6))
To fit stepwise regression models where the number of model terms are informed by cross validation to select df, the number of model terms, and p, the entry threshold, we can use the function cv.stepreg() function.
load( file="~/Documents/Rpackages/glmnetr_supl/using_stepreg_outputs_241228.RLIST" ) #nested.gau.fit$fits[1] = 0
# Fit a relaxed lasso model informed by cross validation cv.stepwise.fit = cv.stepreg(xs,NULL,y_,event,family="cox",folds_n=5,steps_n=30,track=0)
Note, in the derivation of the stepwise regression models, individual coefficients may be unstable even when the model may be stable which elicits warning messages. Thus we "wrapped" the call to cv.stepreg() within the suppressWarnings() function to suppress excessive warning messages in this vignette. The first term in the call to cv.stepreg(), xs, is the design matrix for predictors. The second input term, here NULL, is for the start time in case (start, stop) time data setup is used in a Cox survival model. The third term is the outcome variable for the linear regression or logistic regression model and the time of event or censoring in case of the Cox model, and finally the forth term is the event indicator variable for the Cox model taking the value 1 in case of an event or 0 in case of censoring at time y_. The forth term would be NULL for either linear or logistic regression. Currently the options for family are "guassian" for linear regression, "binomial" for logistic regression (both using the stats glm() function) and "cox" for the Cox proportional hazards regression model using the coxph() function of the R survival package. If one sets track=1 the program will update progress in the R console. For track=0 it will not. To summarize the model fit and inspect the coefficient estimates we use the summary() function.
# summarize model fit ... summary(cv.stepwise.fit)
To extract beta's or calculate predicteds we use the predict() function.
# get betas ... betas = predict(cv.stepwise.fit) t( betas[1:20,] ) # predicteds ... preds = predict(cv.stepwise.fit, xs) t( preds[1:14,] )
Because the values choice for df (number of model terms) or p (significnae level for inclusion) informed by CV are specifically chosen to give a best fit, model fit statistics for the CV derived model will be biased. To address this one can perform a CV on the CV derived estimates, that is a nested cross validation as argued for in SRDM (Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. J Natl Cancer Inst (2003) 95 (1): 14-18. https://academic.oup.com/jnci/article/95/1/14/2520188). This is done here by the nested.glmnetr() function.
# A nested cross validation to evaluate a cross validation informed stepwise fit #nested.cox.fit = nested.glmnetr(xs,NULL,y_,event,family="cox", # dostep=1,doaic=1,folds_n=5,steps_n=30,track=0) y_ = simdata$y_ nested.gau.fit = nested.glmnetr(xs,NULL,y_,NULL,family="gaussian", dostep=1,doaic=1,folds_n=3,steps_n=30,track=0)
# A nested cross validation to evaluate a cross validation informed stepwise fit y_ = simdata$y_ nested.gau.fit = nested.glmnetr(xs,NULL,y_,NULL,family="gaussian", dostep=1,doaic=1,folds_n=3,steps_n=30,track=1)
For this example we use 3 folds. We would generally using between 5 or 10 folds in practice, to get reasonable run times and to better allow variability in variable selection.
#names(nested.gau.fit) summary(nested.gau.fit)
Before providing analysis results the output first reports sample information line sample size, the number of predictors and the df (degrees of freedom) of the design matrix.
Next are the nested cross validation results. First are the per record (or per event in case of the Cox model) log-likelihoods which reflect the amount of information in each observation. Since we are not using large sample theory to base inferences we feel the per record are more intuitive, and they allow comparisons between datasets with unequal sample sizes. Next are the average number of model terms which reflect the complexity of the different models, even if in a naive sense, followed by the agreement statistics, concordance or r-square. These nested cross validated concordances should be essentially unbiased for the given design, unlike the naive concordances where the same data are used to derive the model and calculate the concordances (see SRDM). In this output we are also able to compare the performance of the stepwise regression models with those of the lasso models.
In addition to evaluating the CV informed model fits using another layer of CV, the nested.glmnetr() function does the CV fits based upon the whole data set. Here we see, not unexpectedly, that the model fit measures from the nested CV are somewhat smaller than those naively calculated using the original dataset. Depending on the data the nested CV and naive agreement measures can be very similar or disparate.
Fit information for the CV fit can be gotten by extracting the object$cv.stepreg.fit object and calling the summary() and predict() functions.
# Summary of a CV model fit from a nested CV output object summary(nested.gau.fit$cv.stepreg.fit) # get betas ... betas = predict(nested.gau.fit$cv.stepreg.fit) t( betas[1:10,] ) # get predicteds ... preds = predict(nested.gau.fit$cv.stepreg.fit,xs) t( preds[1:8,] )
save( cv.stepwise.fit , nested.gau.fit , file="~/Documents/Rpackages/glmnetr_supl/using_stepreg_outputs_241228.RLIST" )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.