modelvalid | R Documentation |
The function allows to perform internal validation of a binary Logistic Regression model
implementing most of the procedure described in:
Arboretti Giancristofaro R, Salmaso L. "Model
performance analysis and model validation in logistic regression". Statistica 2003(63):
375–396.
modelvalid(data, fit, B = 200, g = 10, oneplot = TRUE, excludeInterc = FALSE)
data |
Dataframe containing the dataset (Dependent Variable must be stored in the first column to the left). |
fit |
Object returned from glm() function. |
B |
Desired number of iterations (200 by default). |
g |
Number of groups to be used for the Hosmer-Lemeshow test (10 by default). |
oneplot |
TRUE (default) is the user wants the charts returned in a single visualization. |
excludeInterc |
If set to TRUE, the chart showing the boxplots of the parameters distribution across the selected iteration will have y-axis limits corresponding to the min and max of the parameters value; this allows better displaying the boxplots of the model parameters when they end up showing up too much squeezed due to comparatively higher/lower values of the intercept. FALSE is default. |
The procedure consists of the following steps:
(1) the whole dataset is split into two random
parts, a fitting (75 percent) and a validation (25 percent) portion;
(2) the model is fitted
on the fitting portion (i.e., its coefficients are computed considering only the observations in
that portion) and its performance is evaluated on both the fitting and the validation portion,
using AUC as performance measure;
(3) the model's estimated coefficients, p-values, and the
p-value of the Hosmer and Lemeshow test are stored;
(4) steps 1-3 are repeated B times, eventually getting a fitting and validation distribution of
the AUC values and of the HL test
p-values, as well as a fitting distribution of the coefficients and of the associated p-values.
The AUC fitting distribution provides an estimate of the performance of the model in the
population of all the theoretical fitting samples; the AUC validation distribution represents an
estimate of the model’s performance on new and independent data.
The function returns:
-a chart with boxplots representing the fitting distribution of the
estimated model's coefficients; coefficients' labels are flagged with an asterisk when the
proportion of p-values smaller than 0.05 across the selected iterations is at least 95
percent;
-a chart with boxplots representing the fitting and the validation distribution of
the AUC value across the selected iterations. for an example of the interpretation of the chart,
see the aforementioned article, especially page 390-91;
-a chart of the levels of the
dependent variable plotted against the predicted probabilities (if the model has a high
discriminatory power, the two stripes of points will tend to be well separated, i.e. the positive
outcome of the dependent variable will tend to cluster around high values of the predicted
probability, while the opposite will hold true for the negative outcome of the dependent
variable);
-a list containing:
$overall.model.significance: statistics related to the overall model p-value and to its distribution across the selected iterations
$parameters.stability: statistics related to the stability of the estimated coefficients across the selected iterations
$p.values.stability: statistics related to the stability of the estimated p-values across the selected iterations
$AUCstatistics: statistics about the fitting and validation AUC distribution
$Hosmer-Lemeshow statistics: statistics about the fitting and validation distribution of the HL test p-values
As for the abovementioned statistics:
-full: statistic estimated on the full dataset;
-median: median of the statistic across the selected iterations;
-QRNG: interquartile range across the selected iterations;
-QRNGoverMedian: ratio between the QRNG and the median,
expressed as percentage;
-min: minimum of the statistic across the selected iterations;
-max: maximum of the statistic across the selected iterations;
-percent_smaller_0.05: (only for $overall.model.significance, $p.values.stability,
and $Hosmer-Lemeshow statistics): proportion of times in which the p-values are smaller
than 0.05; please notice that for the overall model significance and for the p-values
stability it is desirable that the percentage is at least 95percent, whereas for the HL test
p-values it is indeed desirable that the proportion is not larger than 5percent
(in line with the interpetation of the test p-value which has to be NOT significant in order
to hint at a good fit);
-significant (only for $p.values.stability): asterisk indicating that the p-values of the corresponding coefficient resulted smaller than 0.05 in at least 95percent of the iterations.
logregr
, aucadj
# load the sample dataset data(log_regr_data) # fit a logistic regression model, storing the results into an object called 'model' model <- glm(admit ~ gre + gpa + rank, data = log_regr_data, family = "binomial") # run the function, using 100 iterations, and store the result in the 'res' object res <- modelvalid(data=log_regr_data, fit=model, B=100)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.