calculate_performance: Calculate performance of prognostic models

View source: R/calculate_performance.R

calculate_performanceR Documentation

Calculate performance of prognostic models

Description

This function calculates the different performance measures of prognostic models and factors. Please see below for more details.

Usage

calculate_performance(outcome_type, time, outcome_count, actual, predicted,
develop_model, lp)

Arguments

outcome_type

One of 'binary', 'time-to-event', 'quantitative'. Count outcomes are included in 'quantitative' outcome type and can be differentiated from continuous outcomes by specifying outcome_count as TRUE. Please see examples below.

time

Times at which the outcome was measured. This is applicable only for 'time-to-event' outcome. For other outcome types, enter NA.

outcome_count

TRUE if the outcome was a count outcome and FALSE otherwise.

actual

A vector of actual values.

predicted

A vector of predicted values.

develop_model

TRUE, if you a model was developed; FALSE, if a scoring system with a predetermined threshold (if applicable) was used.

lp

A vector of linear predictors (applicable only if you have developed a model.)

Details

General comment Most of the input parameters are already available from the generic and specific input parameters created using create_generic_input_parameters and create_specific_input_parameters. This function is used by the compile_results function which provides the correct input parameters based on the entries while using create_generic_input_parameters and create_specific_input_parameters and the output from calculate_actual_predicted.

Performance measures The performance was measured by the following parameters. Accuracy Number of correct predictions/number of participants in whom the predictions were made (Rainio et al., 2024). Calibration Three measures of calibration are used. Observed/expected ratio Please see Riley et al., 2024. Values closer to 1 are better; ratios < 1 indicate overestimation of risk by the model while ratios > 1 indicate underestimation of risk by the model (Riley et al., 2024).

We treated underestimation and overestimation equally, i.e., an observed-expected ratio of 0.8 was considered equivalent to 1/0.8 = 1.25. Therefore, we converted the observed-expected ratios to be in the same direction (less than 1) ('modified observed-expected ratio'). This ensured that while calculating the test performance and bootstrap performance, lower numbers consistently indicated worse test performance (as they are more distant from 1) and higher numbers consistently indicated better performance (noting that the maximum value of the modified observed-expected ratio was 1). This modification also helps in interpretation of comparison for different models, some of which may overestimate the risk while others might underestimate the risk.

For assessing the calibration, when the expected events were zero, 0.5 were added to both the observed events and expected events.

Calibration intercept and calibration slope: Calibration slope quantifies the spread of the risk probabilities in relation to the observed events (Stevens et al., 2020). We used the methods described by Riley et al, 2024 to calculate the calibration intercept and slope for all outcomes other than time-to-event outcomes. Essentially, this involves the following regression equation: Y = calibration intercept + coefficient * linear predictor, where 'Y' is the log odds of observed event, log risk of observed event, and the untransformed outcomes for binary, count, and continuous outcomes respectively.

Estimation in time-to-event is lot more uncertain and should be considered experimental. Please note that that Cox regression does not have an intercept separately, as the intercept is included in the baseline hazard (SAS Support, 2017).

Values closer to 1 indicate better performance when the intercept is close to 0; values further away from 1 indicate that the predictions are incorrect in some ranges (Van Calster et al., 2019; Riley et al., 2024; Stevens e al., 2020). The further away from 1, the worse the relationship between the log odds of observed event, log hazard, log risk of observed event, and the untransformed outcome with the linear predictor.

To allow easy comparison with lower values indicating closer to 1 and higher values indicating further away from 1, this function also calculates the 'modified calibration slope' using the following formula: 'Modified calibration slope = absolute value (1-calibration slope)'.

Calibration intercept (also called "calibration-in-the-large") in the calibration regression equation evaluates whether the observed event proportion equals the average predicted risk (Van Calster et al, 2019).

This function also calculates the 'modified calibration intercept' as the absolute value of calibration intercept to allow lower values indicating closer to 0 and higher values indicating further away from 0.

C-statistic This is the area under the ROC curve and a measure of discrimination (Riley et al. 2025). This was calculated using roc. Higher values indicate better performance.

Value

output

A dataframe of the calculated evaluation parameters

Author(s)

Kurinchi Gurusamy

References

Rainio O, Teuho J, Klén R. Evaluation metrics and statistical tests for machine learning. Scientific Reports. 2024;14(1):6086.

Riley RD, Archer L, Snell KIE, Ensor J, Dhiman P, Martin GP, et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ. 2024;384:e074820.

SAS Support. https://support.sas.com/kb/24/457.html (accessed on 16 January 2026).

Stevens RJ, Poppe KK. Validation of clinical prediction models: what does the "calibration slope" really measure? Journal of Clinical Epidemiology. 2020;118:93-9.

Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Medicine. 2019;17(1):230.

See Also

roc

Examples

library(survival)
colon$status <- factor(as.character(colon$status))
# For testing, only 5 simulations are used here. Usually at least 300 to 500
# simulations are a minimum. Increasing the simulations leads to more reliable results.
# The default value of 2000 simulations should provide reasonably reliable results.
generic_input_parameters <- create_generic_input_parameters(
  general_title = "Prediction of colon cancer death", simulations = 5,
  simulations_per_file = 20, seed = 1, df = colon, outcome_name = "status",
  outcome_type = "time-to-event", outcome_time = "time", outcome_count = FALSE,
  verbose = FALSE)$generic_input_parameters
analysis_details <- cbind.data.frame(
  name = c('age', 'single_mandatory_predictor', 'complex_models',
           'complex_models_only_optional_predictors', 'predetermined_model_text'),
  analysis_title = c('Simple cut-off based on age', 'Single mandatory predictor (rx)',
                     'Multiple mandatory and optional predictors',
                     'Multiple optional predictors only', 'Predetermined model text'),
  develop_model = c(FALSE, TRUE, TRUE, TRUE, TRUE),
  predetermined_model_text = c(NA, NA, NA, NA,
  "cph(Surv(time, status) ~ rx * age, data = df_training_complete, x = TRUE, y = TRUE)"),
  mandatory_predictors = c(NA, 'rx', 'rx; differ; perfor; adhere; extent', NA, "rx; age"),
  optional_predictors = c(NA, NA, 'sex; age; nodes', 'rx; differ; perfor', NA),
  mandatory_interactions = c(NA, NA, 'rx; differ; extent', NA, NA),
  optional_interactions = c(NA, NA, 'perfor; adhere; sex; age; nodes', 'rx; differ', NA),
  model_threshold_method = c(NA, 'youden', 'youden', 'youden', 'youden'),
  scoring_system = c('age', NA, NA, NA, NA),
  predetermined_threshold = c('60', NA, NA, NA, NA),
  higher_values_event = c(TRUE, NA, NA, NA, NA)
)
write.csv(analysis_details, paste0(tempdir(), "/analysis_details.csv"),
          row.names = FALSE, na = "")
analysis_details_path <- paste0(tempdir(), "/analysis_details.csv")
# verbose is TRUE as default. If you do not want the outcome displayed, you can
# change this to FALSE
results <- create_specific_input_parameters(
  generic_input_parameters = generic_input_parameters,
  analysis_details_path = analysis_details_path, verbose = TRUE)
specific_input_parameters <- results$specific_input_parameters
# Set a seed for reproducibility - Please see details above
set.seed(generic_input_parameters$seed)
prepared_datasets <- {prepare_datasets(
  df = generic_input_parameters$df,
  simulations = generic_input_parameters$simulations,
  outcome_name = generic_input_parameters$outcome_name,
  outcome_type = generic_input_parameters$outcome_type,
  outcome_time = generic_input_parameters$outcome_time,
  verbose = TRUE)}
# There is no usually no requirement to call this function directly. This is used
# by the perform_analysis function to create the actual and predicted values.
specific_input_parameters_each_analysis <- specific_input_parameters[[1]]
actual_predicted_results_apparent <- {calculate_actual_predicted(
      prepared_datasets = prepared_datasets,
      outcome_name = generic_input_parameters$outcome_name,
      outcome_type = generic_input_parameters$outcome_type,
      outcome_time = generic_input_parameters$outcome_time,
      outcome_count = generic_input_parameters$outcome_count,
      develop_model = specific_input_parameters_each_analysis$develop_model,
      predetermined_model_text =
      specific_input_parameters_each_analysis$predetermined_model_text,
      mandatory_predictors = specific_input_parameters_each_analysis$mandatory_predictors,
      optional_predictors = specific_input_parameters_each_analysis$optional_predictors,
      mandatory_interactions = specific_input_parameters_each_analysis$mandatory_interactions,
      optional_interactions = specific_input_parameters_each_analysis$optional_interactions,
      model_threshold_method = specific_input_parameters_each_analysis$model_threshold_method,
      scoring_system = specific_input_parameters_each_analysis$scoring_system,
      predetermined_threshold = specific_input_parameters_each_analysis$predetermined_threshold,
      higher_values_event = specific_input_parameters_each_analysis$higher_values_event,
      each_simulation = 1, bootstrap_sample = FALSE, verbose = TRUE
    )}
  bootstrap_results <- lapply(1:generic_input_parameters$simulations,
  function(each_simulation) {
  calculate_actual_predicted(
    prepared_datasets = prepared_datasets,
    outcome_name = generic_input_parameters$outcome_name,
    outcome_type = generic_input_parameters$outcome_type,
    outcome_time = generic_input_parameters$outcome_time,
    outcome_count = generic_input_parameters$outcome_count,
    develop_model = specific_input_parameters_each_analysis$develop_model,
    predetermined_model_text =
      specific_input_parameters_each_analysis$predetermined_model_text,
    mandatory_predictors = specific_input_parameters_each_analysis$mandatory_predictors,
    optional_predictors = specific_input_parameters_each_analysis$optional_predictors,
    mandatory_interactions = specific_input_parameters_each_analysis$mandatory_interactions,
    optional_interactions = specific_input_parameters_each_analysis$optional_interactions,
    model_threshold_method = specific_input_parameters_each_analysis$model_threshold_method,
    scoring_system = specific_input_parameters_each_analysis$scoring_system,
    predetermined_threshold = specific_input_parameters_each_analysis$predetermined_threshold,
    higher_values_event = specific_input_parameters_each_analysis$higher_values_event,
    each_simulation = each_simulation, bootstrap_sample = TRUE, verbose = TRUE
  )
})
apparent_performance <- {cbind.data.frame(
  performance = "apparent", simulation = NA,
  calculate_performance(
    outcome_type = generic_input_parameters$outcome_type,
    time = actual_predicted_results_apparent$time_all_subjects,
    outcome_count = generic_input_parameters$outcome_count,
    actual = actual_predicted_results_apparent$actual_all_subjects,
    predicted = actual_predicted_results_apparent$predicted_all_subjects,
    develop_model = specific_input_parameters_each_analysis$develop_model,
    lp = actual_predicted_results_apparent$lp_all_subjects
  ))}

EQUALPrognosis documentation built on Feb. 4, 2026, 5:15 p.m.