View source: R/calculate_performance.R
| calculate_performance | R Documentation |
This function calculates the different performance measures of prognostic models and factors. Please see below for more details.
calculate_performance(outcome_type, time, outcome_count, actual, predicted,
develop_model, lp)
outcome_type |
One of 'binary', 'time-to-event', 'quantitative'. Count outcomes are included in 'quantitative' outcome type and can be differentiated from continuous outcomes by specifying outcome_count as TRUE. Please see examples below. |
time |
Times at which the outcome was measured. This is applicable only for 'time-to-event' outcome. For other outcome types, enter NA. |
outcome_count |
TRUE if the outcome was a count outcome and FALSE otherwise. |
actual |
A vector of actual values. |
predicted |
A vector of predicted values. |
develop_model |
TRUE, if you a model was developed; FALSE, if a scoring system with a predetermined threshold (if applicable) was used. |
lp |
A vector of linear predictors (applicable only if you have developed a model.) |
General comment
Most of the input parameters are already available from the generic and specific input
parameters created using create_generic_input_parameters and
create_specific_input_parameters. This function is used by the
compile_results function which provides the correct input parameters
based on the entries while using create_generic_input_parameters and
create_specific_input_parameters and the output from
calculate_actual_predicted.
Performance measures The performance was measured by the following parameters. Accuracy Number of correct predictions/number of participants in whom the predictions were made (Rainio et al., 2024). Calibration Three measures of calibration are used. Observed/expected ratio Please see Riley et al., 2024. Values closer to 1 are better; ratios < 1 indicate overestimation of risk by the model while ratios > 1 indicate underestimation of risk by the model (Riley et al., 2024).
We treated underestimation and overestimation equally, i.e., an observed-expected ratio of 0.8 was considered equivalent to 1/0.8 = 1.25. Therefore, we converted the observed-expected ratios to be in the same direction (less than 1) ('modified observed-expected ratio'). This ensured that while calculating the test performance and bootstrap performance, lower numbers consistently indicated worse test performance (as they are more distant from 1) and higher numbers consistently indicated better performance (noting that the maximum value of the modified observed-expected ratio was 1). This modification also helps in interpretation of comparison for different models, some of which may overestimate the risk while others might underestimate the risk.
For assessing the calibration, when the expected events were zero, 0.5 were added to both the observed events and expected events.
Calibration intercept and calibration slope: Calibration slope quantifies the spread of the risk probabilities in relation to the observed events (Stevens et al., 2020). We used the methods described by Riley et al, 2024 to calculate the calibration intercept and slope for all outcomes other than time-to-event outcomes. Essentially, this involves the following regression equation: Y = calibration intercept + coefficient * linear predictor, where 'Y' is the log odds of observed event, log risk of observed event, and the untransformed outcomes for binary, count, and continuous outcomes respectively.
Estimation in time-to-event is lot more uncertain and should be considered experimental. Please note that that Cox regression does not have an intercept separately, as the intercept is included in the baseline hazard (SAS Support, 2017).
Values closer to 1 indicate better performance when the intercept is close to 0; values further away from 1 indicate that the predictions are incorrect in some ranges (Van Calster et al., 2019; Riley et al., 2024; Stevens e al., 2020). The further away from 1, the worse the relationship between the log odds of observed event, log hazard, log risk of observed event, and the untransformed outcome with the linear predictor.
To allow easy comparison with lower values indicating closer to 1 and higher values indicating further away from 1, this function also calculates the 'modified calibration slope' using the following formula: 'Modified calibration slope = absolute value (1-calibration slope)'.
Calibration intercept (also called "calibration-in-the-large") in the calibration regression equation evaluates whether the observed event proportion equals the average predicted risk (Van Calster et al, 2019).
This function also calculates the 'modified calibration intercept' as the absolute value of calibration intercept to allow lower values indicating closer to 0 and higher values indicating further away from 0.
C-statistic
This is the area under the ROC curve and a measure of discrimination (Riley et al.
2025). This was calculated using roc. Higher values indicate
better performance.
output |
A dataframe of the calculated evaluation parameters |
Kurinchi Gurusamy
Rainio O, Teuho J, Klén R. Evaluation metrics and statistical tests for machine learning. Scientific Reports. 2024;14(1):6086.
Riley RD, Archer L, Snell KIE, Ensor J, Dhiman P, Martin GP, et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ. 2024;384:e074820.
SAS Support. https://support.sas.com/kb/24/457.html (accessed on 16 January 2026).
Stevens RJ, Poppe KK. Validation of clinical prediction models: what does the "calibration slope" really measure? Journal of Clinical Epidemiology. 2020;118:93-9.
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Medicine. 2019;17(1):230.
roc
library(survival)
colon$status <- factor(as.character(colon$status))
# For testing, only 5 simulations are used here. Usually at least 300 to 500
# simulations are a minimum. Increasing the simulations leads to more reliable results.
# The default value of 2000 simulations should provide reasonably reliable results.
generic_input_parameters <- create_generic_input_parameters(
general_title = "Prediction of colon cancer death", simulations = 5,
simulations_per_file = 20, seed = 1, df = colon, outcome_name = "status",
outcome_type = "time-to-event", outcome_time = "time", outcome_count = FALSE,
verbose = FALSE)$generic_input_parameters
analysis_details <- cbind.data.frame(
name = c('age', 'single_mandatory_predictor', 'complex_models',
'complex_models_only_optional_predictors', 'predetermined_model_text'),
analysis_title = c('Simple cut-off based on age', 'Single mandatory predictor (rx)',
'Multiple mandatory and optional predictors',
'Multiple optional predictors only', 'Predetermined model text'),
develop_model = c(FALSE, TRUE, TRUE, TRUE, TRUE),
predetermined_model_text = c(NA, NA, NA, NA,
"cph(Surv(time, status) ~ rx * age, data = df_training_complete, x = TRUE, y = TRUE)"),
mandatory_predictors = c(NA, 'rx', 'rx; differ; perfor; adhere; extent', NA, "rx; age"),
optional_predictors = c(NA, NA, 'sex; age; nodes', 'rx; differ; perfor', NA),
mandatory_interactions = c(NA, NA, 'rx; differ; extent', NA, NA),
optional_interactions = c(NA, NA, 'perfor; adhere; sex; age; nodes', 'rx; differ', NA),
model_threshold_method = c(NA, 'youden', 'youden', 'youden', 'youden'),
scoring_system = c('age', NA, NA, NA, NA),
predetermined_threshold = c('60', NA, NA, NA, NA),
higher_values_event = c(TRUE, NA, NA, NA, NA)
)
write.csv(analysis_details, paste0(tempdir(), "/analysis_details.csv"),
row.names = FALSE, na = "")
analysis_details_path <- paste0(tempdir(), "/analysis_details.csv")
# verbose is TRUE as default. If you do not want the outcome displayed, you can
# change this to FALSE
results <- create_specific_input_parameters(
generic_input_parameters = generic_input_parameters,
analysis_details_path = analysis_details_path, verbose = TRUE)
specific_input_parameters <- results$specific_input_parameters
# Set a seed for reproducibility - Please see details above
set.seed(generic_input_parameters$seed)
prepared_datasets <- {prepare_datasets(
df = generic_input_parameters$df,
simulations = generic_input_parameters$simulations,
outcome_name = generic_input_parameters$outcome_name,
outcome_type = generic_input_parameters$outcome_type,
outcome_time = generic_input_parameters$outcome_time,
verbose = TRUE)}
# There is no usually no requirement to call this function directly. This is used
# by the perform_analysis function to create the actual and predicted values.
specific_input_parameters_each_analysis <- specific_input_parameters[[1]]
actual_predicted_results_apparent <- {calculate_actual_predicted(
prepared_datasets = prepared_datasets,
outcome_name = generic_input_parameters$outcome_name,
outcome_type = generic_input_parameters$outcome_type,
outcome_time = generic_input_parameters$outcome_time,
outcome_count = generic_input_parameters$outcome_count,
develop_model = specific_input_parameters_each_analysis$develop_model,
predetermined_model_text =
specific_input_parameters_each_analysis$predetermined_model_text,
mandatory_predictors = specific_input_parameters_each_analysis$mandatory_predictors,
optional_predictors = specific_input_parameters_each_analysis$optional_predictors,
mandatory_interactions = specific_input_parameters_each_analysis$mandatory_interactions,
optional_interactions = specific_input_parameters_each_analysis$optional_interactions,
model_threshold_method = specific_input_parameters_each_analysis$model_threshold_method,
scoring_system = specific_input_parameters_each_analysis$scoring_system,
predetermined_threshold = specific_input_parameters_each_analysis$predetermined_threshold,
higher_values_event = specific_input_parameters_each_analysis$higher_values_event,
each_simulation = 1, bootstrap_sample = FALSE, verbose = TRUE
)}
bootstrap_results <- lapply(1:generic_input_parameters$simulations,
function(each_simulation) {
calculate_actual_predicted(
prepared_datasets = prepared_datasets,
outcome_name = generic_input_parameters$outcome_name,
outcome_type = generic_input_parameters$outcome_type,
outcome_time = generic_input_parameters$outcome_time,
outcome_count = generic_input_parameters$outcome_count,
develop_model = specific_input_parameters_each_analysis$develop_model,
predetermined_model_text =
specific_input_parameters_each_analysis$predetermined_model_text,
mandatory_predictors = specific_input_parameters_each_analysis$mandatory_predictors,
optional_predictors = specific_input_parameters_each_analysis$optional_predictors,
mandatory_interactions = specific_input_parameters_each_analysis$mandatory_interactions,
optional_interactions = specific_input_parameters_each_analysis$optional_interactions,
model_threshold_method = specific_input_parameters_each_analysis$model_threshold_method,
scoring_system = specific_input_parameters_each_analysis$scoring_system,
predetermined_threshold = specific_input_parameters_each_analysis$predetermined_threshold,
higher_values_event = specific_input_parameters_each_analysis$higher_values_event,
each_simulation = each_simulation, bootstrap_sample = TRUE, verbose = TRUE
)
})
apparent_performance <- {cbind.data.frame(
performance = "apparent", simulation = NA,
calculate_performance(
outcome_type = generic_input_parameters$outcome_type,
time = actual_predicted_results_apparent$time_all_subjects,
outcome_count = generic_input_parameters$outcome_count,
actual = actual_predicted_results_apparent$actual_all_subjects,
predicted = actual_predicted_results_apparent$predicted_all_subjects,
develop_model = specific_input_parameters_each_analysis$develop_model,
lp = actual_predicted_results_apparent$lp_all_subjects
))}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.