Calibration of Machine Learning Models in glmnetr

knitr::opts_chunk$set(echo=TRUE)
#library( glmnetr )
#library(survival)
source("~/Documents/Rpackages/glmnetr_supl/_load_glmnetr_vig_241228.R" , echo=TRUE)
set.seed(116291949) 
simdata=glmnetr.simdata(nrows=1000, ncols=100, beta=NULL)
xs = simdata$xs                           # matrix of predictors 
yt = simdata$yt                           # vector of Gaussian (normal) outcomes
yg = simdata$y_                           # vector of Gaussian (normal) outcomes
yb = simdata$yb                           # vector of yes/no (binomial) outcomes
event = simdata$event
load( file="~/Documents/Rpackages/glmnetr_supl/using_glmnetr_outputs_241228.RLIST" ) ## 240828 

The Package

The "An Overview of glmnetr" vignette shows how to run the main package function nested.glmnetr() and how to summarize model performances. If one identifies a well performing model according to the metrics in this summary, e.g. concordance, correlation, deviance ratio, linear calibration, one may want to do further evaluation in terms of calibration. The strongest calibration and validation will involve calibration with new, independent datasets. Frequently one will not have immediate access to such new data sets, or one may want first to do an internal validation before subjecting a model to an external validation. Here we consider an internal validation approach using cross validation or bootstrap re-sampling, similar to how we numerically assessed model performance.

An Example Analysis

To explore calibration we first consider the nested.glmnetr() call from the "An Overview of glmnetr" vignette which fit machine learning models to survival data with family="cox", i.e.

set.seed(465783345) 
nested.cox.fit = nested.glmnetr(xs, NULL, yt, event, family="cox", 
                                dolasso=1, dostep=1, steps_n=40, folds_n=10, track=1)  
nested.cox.fit$fits[c(2:5,7,8)] = 0 

Linear Calibration

Using either print() or summary() on the output object nested.cox.fit one gets, amongst other information, summaries for the linear calibration slopes and intercepts as in

summary( nested.cox.fit ) 

Here we see that for many of the models the linear calibration slope term is near 1, the ideal for perfect calibration. For the Cox model any intercept term can be absorbed into the baseline survival function and there is no pertinent intercept term for calibration.

A First Visual

An initial calibration consideration was made in the overview vignette by regressing observed outcomes on the predicteds from the final model based upon the relaxed lasso. This regression was made using splines, in particular the pspline() function from within a coxph() call, as in

# Get predicteds from CV relaxed lasso model embedded in nested CV outputs & Plot
xb.hat = predict( object=nested.cox.fit , xs_new=xs, lam=NULL, gam=NULL, comment=FALSE) 
# Fit a spline to xb.hat uisng coxph, and plot 
#library( survival )                  ## load survival package for Cox model fits 
fit1 = coxph(Surv(yt, event) ~ pspline(xb.hat, df=3))
summary(fit1)

followed by plotting with

termplot(fit1,term=1,se=TRUE, rug=TRUE, lwd.term=2, lwd.se=2, lty.se=1) # , col.term=1, col.se=2
abline(a=0,b=1,lty=3)
title ("Naive calibration curve for relaxed lasso model")

The spline fits may help to understand potential nonlinearities in the model. Here we see, a clibration line which is not far from linear. Still, as noted in the "An Overview of glmnetr" vignette, because the same data are used for model evaluation as well as model derivation, it is hard to put much confidence in such a calibration plot because of potential bias which may suggest a better fit than can be expected for new data.

Calibration Using Spline Fits and Resampling

For each of the models fit, nested.glmnetr() saves the X Beta's from the final model. The nested.glmentr() function also calculates the X Betas's for the hold out data for each partitioning, i.e. each hold out fold of the outer loop of nested cross validation or the out-of-bag items not selected by the sample whit replacement of the bootstrap sample. In this manner there are multiple subsets, e.g. k from the k-fold nested CV, or calculation of X*Betas based upon independent observations, and each of these subsets can contribute to calibrate the final model. While each of these calibrations will individually have limited information, when combined following the principles of cross validation for boostrap sampling, they will collectively provide a more meaningful evaluation. This is done by the calplot() function as in

calplot(nested.cox.fit, wbeta=5)

Here we see a smooth, nearly linear predicted log hazard ratio as a function of the model X* Beta from the relaxed lasso model. The bounding lines in red depict the average +/- 2 standard errors (SE) to assist in assessing meaningfulness in any deviation from the ideal identity line, and non linearities. In these curves the central region with solid lines denotes the region within the range of all the calibration spline fit, i.e. spline fits from all the different leave-out folds of the CV overlap without extrapolation. The dashed lines depict areas out of range for at least one of the leave out folds. Because spline fits can be rather uncertain when extrapolating beyond the data range, one should be more cautious in making strong conclusions in the dashed regions of these plots. Here we see somewhat wider confidence bounds about the overall calibration curve.

Bengio et al. showed that the standard deviations from cross validation might not be accurate for estimation of the actual standard errors. Bates et al. showed that the cross validation (CV) estimates and standard errors may be biased, and should thus be viewed with caution. In particular the CV estimates may more closely estimate the expected performance measures over multiple samples than the performance of the model based upon the observed data, and usage the CV standard errors when constructing confidence intervals might be associated with mis-coverage three times the nominal non-coverage. This is discussed further in the vignette "An Overview of glmnetr".

A naive calibration curve similar to that shown above (the first figure) can also be easily gotten using the calplot function when specifying to not use the resample for construction as in

calplot(nested.cox.fit, wbeta=5, resample=0, xlim=c(-10,9), ylim=c(-15,10), col.term=2, col.se=7)

The code above using the termplot() function is provided to show our general approach for derivation of the calibration curves.

In the Nested CV figure we see two rugs, one below and one above the plotted region. The rug below depicts the model X Beta's which are not associated with an event and the rug above depicts X Beta's which are associated with events. When there are lots of data points it can be hard to read these rugs. One can use the vref option in calplot to draw two vertical lines where the first separates the smaller vref% of the X Beta's form the rest, and a second which separates the larger vref% of the data. To depict the hazard ratios (HR) instead of the XBeta for the Cox model one can use the option plothr, where one assigns a numerical value for the product between tick marks, e.g. exp(1) or 10. Combining these two options we have the example

calplot(nested.cox.fit, wbeta=5, vref=1, plothr=100)

The user can also use different colors for the lines with the options col.term, col.se. One can also specify xlim and ylim in case a few data points cause an excessive amount of white space or odd aspect ratio in the plots.

To view the calibration plots form the individual leave out cross validation folds, one may specify foldplot= 1. In that this generates many figures, we omit in this vignette actually producing plots using this option specification, and instead assign plotfold=1 which overlays the individual calibration curves, albeit without the +/- 2 SE limits for the individual CV folds. The overall calibration (average of the individual CV fold calibrations) and overall +/- 2 SE limits though are maintained.

nested.cox.fit$xbetas.cv = nested.cox.fit$xbetas.cv[,c(1:15,26,27)] 
calplot(nested.cox.fit, 5, plotfold=1, vref=1, plothr=100)

As we see from the above calls the first term in the calplot() function call is an output object form a nested.glmnetr() call. The second term, wbeta, specifies "which beta" or model is to be used for deriving the model X*Beta's. Here, as we see in the figure x-axis label, the 5 determines the relaxed lasso model. Instead of making a hard to remember key the user can leave this term unspecified and a key will be directed to the R console. The actual numbers for the different models will depend on which models are fit and so this key is dynamic.

calplot(nested.cox.fit)

From this key we read of the numers corresponding to the respective models. The variance in X*beta for the "null" model is 0 as the intercept for the Cox model is arbitrarily assinged the value of 0 for each resample model fit. From this key we see we can produce a calibration plot for the ridge regression model by setting wbeta = 8 (wbeta for "which beta"), as in

calplot(nested.cox.fit, 8)

Here we see the model is not ideally calibrated as the calibration curve largely does not include the identity line, and it requires a correction to achieve an un (less) biased estimation of the hazard ratio. Inspecting the calibration curve for a step wise regression model

calplot(nested.cox.fit, 16)

we see that the response is roughly linear but numerically at least there seems to be some correction for over fitting.

To obtain the numerical values used to construct these calibration plots one may specify plot=0 (or plot=2 to plot and obtain the numerical data) in list format as in

tmp = calplot(nested.cox.fit, 5, plot=0)
str(tmp)

These data may be further processed by the user.

A Binomial Model

For nested.glmnetr() analyses with family = "binomial" with the call

yb = simdata$yb 
nested.bin.fit = nested.glmnetr(xs,NULL,yb,NULL,family="binomial",
      dolasso=1, doxgb=list(nrounds=250), dorf=1, doorf=1, doann=1, 
      folds_n=10, seed=219301029, track=1) 

an example calibration plot is

calplot(nested.bin.fit, 5, plotfold=0) 

Since these data were generated with probabilities exp(Xbeta)/(1+exp(Xbeta)) we want that the medal would calibrate linearly. Next we look at the "fully" relaxed model where an unpenalized model is fit based upon the non-zero terms in the fully penalized lasso model.

calplot(nested.bin.fit, 7, plotfold=0) 

This may calibrate slightly better but is not as linear as we might expect.

A Binomial Model Calibrated Using Bootstrap

Considering this we next fit models using bootstrap, that is fit models based upon random samples from the original sample (with replacement) of same sample size as the original sample. Then we fit calibration curves for the out-of-bag sample units for each bootstrap sample fitted model, that is the elements of the original sample that are not selected by the bootstrap sample. The number of bootstrap samples for calculation is specified using the bootstrap option in the nested.glmnetr() call,

yb = simdata$yb 
nested.bin.boot.fit = nested.glmnetr(xs,NULL,yb,NULL,family="binomial",
      dolasso=1, dorf=1, 
      folds_n=10, seed=219301029, track=1, bootstrap=20) 

The Out-Of-Bag (OOB) calibration plots can then be constructed using the calplot() function as before,

calplot(nested.bin.boot.fit, 5, plotfold=1) 

Here we see a similar yet slightly different characteristics between the (nested) cross validation and bootstrap calibration plots. While technically the bootstrap might seem more accurate due to the confidence interval containing the ideal line (which we know applies due ot the way we simulated the data), this may as well be due to the random nature of the re-sample selection process. For the bootstrap one can run a larger number of re-samples to minimize this effect. For the cross validation one may repeat the whole process many times (see the glmnet package vignettes) and then average, but we have not done this here.

Variability in Bootstrap In-Bag Calibrations

In earlier works Austin et al. as well as Riley et al. fit models based upon bootstrap samples and then fit calibration curves based upon the in-bag data points and model XBeta (predicteds). This mimics the usual procedure for bootstrap analysis. This can be done using the calplot() function by setting the oob option to 0,

calplot(nested.bin.boot.fit, wbeta=5, oob=0, plotfold=1)

This figure shows the variability expected in the construction of calibration curves as described by the above authors. These curves however make no adjustment for bias correction possible using the bootstrap. Making the usual bias adjustment we have (XBeta_full) - (XBeta_i - XBeta_full) or 2*XBeta_full - XBeta_i. These can be produced setting the bootci option to 1 as with the call

calplot(nested.bin.boot.fit, wbeta=5, bootci=1, plotfold=1)

This figure differs from the simple in-bag calibrations. These two figures differ meaningfully from the out-of-bag calibration curves described above.

Bootstrap In-Bag Calibration for a Random Forest

These bootstrap plots above were for the relaxed lasso model fit. We now examine similar bootstrap calibration plots but for random forest models. The variability in the in-bag bootstrap calibratin curves for the random forest model is depicted in the graph

calplot(nested.bin.boot.fit, wbeta=16, bootci=0, oob=0, plotfold=1) 

Here we see a strong deviation from the identity line. The bias corrected bootstrap calibration curves

calplot(nested.bin.boot.fit, wbeta=16, bootci=1, plotfold=1)

have a more unusual and unexpected deviation from the ideal calibration curve of the identity line. Returning to the out-of-bag calibration bootstrap plots

calplot(nested.bin.boot.fit, wbeta=16, plotfold=1)

we see that these too are highly variable but more consistent with the ideal calibration curve, the identity line, than the in-bag calıbratıon curves.

While these two in-bag calibration curve sets strongly depart from the ideal of the identity line, the out-of-bag curves while possibly highly variable were consistent as a group with the ideal of the identity line. The bootstrap out-of-bag calibration curves have a clearer basis and depict more reasonable findings. In a sense this is not unexpected. When fitting machine learning models one typically evaluates model fit based upon the out-of-bag data points and not the in-bag data points. A difficulty here could be that the calibration curves, calculated at particular values for XBeta, do not describe a well defined parameter in terms of the disturbution function. Potentially as well the fitting process for machine learning models might not lend it self to the assumptions typically used to support bootstrap inferences. In particular as we see here when fitting random forest models, with the multiple repeat observations in the bootstrap resamples the model refits are overly optimistic, more strongly overfit and give more extreme XBeta as evidenced in these figures. We present the in-bag and biased adjusted bootstrap calibration curves not to promote their usage but to 1) show the variability in model fit and 2) how they may give poor inferences for a simple dataset. The user may want to generate other datasets of different known form and see how the bootstrap performs for their data.

A Normal (Gaussian Errors) Model

First calculating the numerical summaries of prediction performance

nested.gau.fit = nested.glmnetr(xs,NULL,y_,NULL,family="gaussian",
      dolasso=1, doxgb=list(nrounds=250), dorf=1, doorf=1, doann=list(bestof=10), 
      folds_n=10, seed=219301029, track=1) 

and then plotting

calplot(nested.gau.fit, wbeta=7)

we see a small but probably not significant deviation from the ideal calibration line, the identity function.

Perspective

The summary and calibration plot functions used here do not address all needed for model validation and calibration but do allow a meaningful and un (or minimally) biased summary of model fits. The original outcome variable and X*betas are stored as vectors or matrices in the output with names y_, xbetas (full model) and xbetas.cv (cross-validation) and xbetas.boot.oob and betas.boot.inb (bootstrap out-of-bag and in-bag) allowing the user to further inspect model fits and to perform other means of calibration. For cross-validation analyses the fold information is contained in the output object in the vector foldid. For bootstrap analyses the first column of xbetas.boot.oob and betas.boot.inb describe the replication of the bootstrap sampling process and the second columns the sequential index for the data point in the input dataset.

References

Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 2014; 33(3):517-35. doi: 10.1002/sim.5941 .

Bates S, Hastie T, Tibshirani R, "Cross-validation: what does it estimate and how well does it do it?", 2022, https://arxiv.org/abs/2104.00673 .

Bengio Y & Grandvalet Y, "No Unbiased Estimator of the Variance of K-Fold Cross-Validation", Journal of Machine Learning Research 5, 2004, 1089–1105, https://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf .

Riley RD, Pate A, Dhiman P, Archer L, Martin GP, Collins GS. Clinical prediction models and the multiverse of madness. BMC Med. 2023; 21(1):502. doi: 10.1186/s12916-023-03212-y .



Try the glmnetr package in your browser

Any scripts or data that you put into this service are public.

glmnetr documentation built on April 3, 2025, 6:45 p.m.