calculate_actual_predicted: Calculate actual and predicted values

View source: R/calculate_actual_predicted.R

calculate_actual_predictedR Documentation

Calculate actual and predicted values

Description

This takes the datasets, prepared using the prepare_datasets function and the input parameters described below, creates a model, runs the model, and predicts the outcome.

Usage

calculate_actual_predicted(prepared_datasets, outcome_name, outcome_type,
outcome_time, outcome_count, develop_model, predetermined_model_text,
mandatory_predictors, optional_predictors, mandatory_interactions,
optional_interactions, model_threshold_method, scoring_system,
predetermined_threshold, higher_values_event, each_simulation,
bootstrap_sample, verbose)

Arguments

prepared_datasets

Datasets prepared using the prepare_datasets.

outcome_name

Name of the colummn that contains the outcome data. This must be a column name in the 'df' provided as input.

outcome_type

One of 'binary', 'time-to-event', 'quantitative'. Count outcomes are included in 'quantitative' outcome type and can be differentiated from continuous outcomes by specifying outcome_count as TRUE. Please see examples below.

outcome_time

The name of the column that provides the follow-up time. This is applicable only for 'time-to-event' outcome. For other outcome types, enter NA.

outcome_count

TRUE if the outcome was a count outcome and FALSE otherwise.

develop_model

TRUE, if you want to develop a model; FALSE, if you want to use a scoring system with a predetermined threshold (if applicable).

predetermined_model_text

You can create the model text from the mandatory and optional predictors and interactions or finer control of the model, you can provide the model text directly.

mandatory_predictors

Predictors that must be included in the model. These should be provided even if you provide the 'predetermined_model_text'.

optional_predictors

Optional predictors that may be included in the model by step. These should be provided even if you provide the 'predetermined_model_text'.

mandatory_interactions

Interactions that must be included in the model. These should be provided even if you provide the 'predetermined_model_text'.

optional_interactions

Optional interactions that may be included in the model by step. These should be provided even if you provide the 'predetermined_model_text'.

model_threshold_method

One of 'youden', 'topleft', 'heuristic'. Please see description below.

scoring_system

Name of the pre-existing scoring system. This is ignored if develop_model is TRUE.

predetermined_threshold

Pre-determined threshold of the pre-existing scoring system. This is mandatory when develop_model is FALSE and when the outcome_type is 'binary' or 'time-to-event'. This is ignored if develop_model is TRUE or when the outcome_type is 'quantitative'.

higher_values_event

TRUE if higher values of the pre-existing system indicates event and FALSE otherwise. This is mandatory when develop_model is FALSE and when the outcome_type is 'binary' or 'time-to-event'. This is ignored if develop_model is TRUE or when the outcome_type is 'quantitative'.

each_simulation

The number of the simulation in the prepared datasets. Please see prepare_datasets.

bootstrap_sample

TRUE if you are calculating the bootstrap and test performance and FALSE if you are calculating the apparent performance. Please see below and Collins et al, 2024.

verbose

TRUE if the progress must be displayed and FALSE otherwise.

Details

General comment Most of the input parameters are already available from the generic and specific input parameters created using create_generic_input_parameters and create_specific_input_parameters. This function is used by the perform_analysis function which provides the correct input parameters based on the entries provided in create_generic_input_parameters and create_specific_input_parameters.

Overview This is a form of enhanced bootstrapping internal validation approach to calculate the optimism-corrected performance measures described by Collins et al., 2024. This involves calculating the apparent performance by developing the model in the entire dataset, repeated sampling with replacement (bootstrap sample), evaluating the performance of the model in each simulation of the bootstrap sample (bootstrap performance), evaluating the performance of the model (developed in the bootstrap sample of each simulation) on the ‘test sample’ i.e., all the subjects in the dataset from which the bootstrap sample was obtained (test performance), calculating the optimism as the difference between bootstrap performance and test performance in each simulation, calculating the average optimism, and finally subtracting the average optimism from the apparent performance to calculate the optimism-corrected performance measures (Collins et al, 2024).

The model development is performed using glm for all outcomes other than time-to-event outcomes and coxph for time-to-event outcomes. You can either provide a model text that you have developed for finer control of the interactions to be considered or included or you can let the computer build the model text based on the mandatory and optional predictors and interactions.

Linear predictors The linear predictor describes the relationship between the outcome and the predictors, and is a function of the covariate (predictor) values and coefficients of the regression. It can be described by the following relation. Linear predictor = alpha + beta_predictors * predictors + beta_(predictors_interactions) (if interactions between predictors are included) + error.

However, except for linear regression, the linear predictor must be transformed to obtain the outcome. This is because of the way generalised linear regression attempts to create a linear relationship between the outcome and predictors.

If 'Y' is the outcome, the linear predictor is 'logit Y' for binary outcomes; therefore, inverse logit transformation must be performed to convert the linear predictor to obtain the probability of an outcome. For count outcomes, the linear predictor is 'log Y'; therefore, exponential transformation is required to obtain the predicted number of events.

For time-to-event outcomes, the linear predictor gives the hazard of an event at various time points for a subject at the given covariate levels. The function basehaz provides a more clinically meaningful cumulative hazard of an event by time 't', denoted as 'H(t)'. The function basehaz provides the cumulative hazard of the event at each time point denoted as 't' for each subject in the 'training' set. The cumulative hazard of the event by time 't' of a new subject can be calculated using the relation mentioned in the description of basehaz (please see the section on calculating H(t;x)). Using this relation, one can find the closest time point of a 'new' subject to the time points in the output of basehaz, the corresponding cumulative hazard for the 'first' subject (or any other subject for whom the cumulative hazard at each time point is available), and the differences in covariate values between the 'new' subject and the 'first' subject to calculate the cumulative hazard of event by the follow-up time of the new subject. The survival probability is exp(-H(t)) (Simon et al. 2024); therefore, one can calculate the probability of event by time 't' as 1 - exp(-H(t)).

For continous outcomes, no transformation of linear predictor is required to obtain the outcome.

Obtaining linear predictors In regression models, we can get the get the transformed values of the linear predictors (lp) (i.e., inverse logit transformation for binary outcome and exponential for count outcomes) based on the regression model directly. For example, using the type = "response" in predict function gives this information directly for all outcome types other than time-to-event outcomes, which are analysed with coxph. For time-to-event outcomes, the type = "expected" gives the cumulative hazard by 't' after adjusting for the covariates (predict.coxph), from which one can estimate the probability of the event by the time 't' using the relations described above.

Missing linear predictors When predicting using the regression models directly as described above, there must be no missing data for the predictors included in the model. One possibility is to not make a prediction at all. However, in real life some of these predictors will be missing but a decision must be made. One possibility is multiple imputation. However, some assumptions about the missing data can be difficult to verify (Heymans et al, 2022). Another possibility is to exclude the missing predictor (whose value is missing) from the regression equation. Although the coefficient values would have been different without the predictor, it is impossible to develop and validate for all scenarios of missing predictors. This function calculates the linear predictor by excluding the predictors which contain missing data (for that subject) using the regression model developed on subjects without missing data. If the coefficient values in the model indicates NA (which should alert people to overfitting the data or levels with sparse data), the variable level itself is removed from calculating the linear predictor. To a large extent, the method used assumes that external validation will be performed before changing clinical practice and the application of this method compared to other methods of handling missing data must be compared as part of external validation.

Conversion of probabilities of event (linear predictors) for binary and time-to-event outcomes to event versus no event There are multiple ways of converting probabilities of event (linear predictors) for binary and time-to-event outcomes to event versus no event. For example, one can consider that the probabilities of event are from binomial distribution for binary outcomes. Alternatively, one can choose an 'optimal threshold' (on the training set) using the roc and coords functions. There are two types of threshold calculated by the coords function: 'Youden' and 'closest to top left'. For further information, please see coords. Occasionally, it may not be possible to obtain the threshold using roc and coords. A function that performs a rough estimation of the threshold based on prevalence is included in the source code of this function (please see 'calculate_heuristic_threshold' function, included as part of this function).

Intercept-slope adjustment In regression models, the intercept and slope can be adjusted (Van Calster et al., 2019). The calibration intercept and slope are calculated according to the supplement of Van Calster et al., 2019. The paper provides details only for logistic regression, but the procedures are based on glm, i.e., they are applicable in glm models. The relation is regression equation is Y = calibration_intercept + beta * linear predictor.

Note that the linear predictor in the equation is used as variable rather than as an offset term as with calculation of calibration intercept only. The linear predictors must be back-transformed to the original scale before their use in the calibration regression equation.

Robust methods for calibration slope adjustment for time-to-events are still being developed. Until such methods become widely available, this function uses similar principles as that described for binary outcomes for time-to-event outcomes. These should be considered experimental until further evaluation of the performance of calibration adjustment in external samples. It should be noted however, that for time-to-event outcomes, Cox regression does not have an intercept separately, as the intercept is included in the baseline hazard (SAS support 2017). Therefore, with regards to time-to-event outcomes, there is no change to the intercept, but there is a change to the slope when calibration adjusted models are created.

Model with with only the mandatory predictors but based on the coefficients of the entire model This is solely for research purposes. Potential use of such a model with only the mandatory predictors, but based on the coefficients of the entire model will be to find the added value of measurement of optional predictors, particularly when there is a single mandatory predictor, for example, a treatment. It will be practically impossible to develop all the possible models with missing optional predictors. This model has the potential to provide predictions in this situation.

Value

actual_training

Actual values in the training sample.

predicted_training

Predicted values in the training sample.

predicted_training_calibration_adjusted

Predicted values after calibration adjustment.

predicted_training_adjusted_mandatory_predictors_only

Predicted values of a model with only the mandatory predictors, but based on the coefficients of the entire model.

actual_only_validation

Actual values in the 'out-of-sample' subjects, i.e., the subjects excluded from the model development in each simulation.

predicted_only_validation

Predicted values in the 'out-of-sample' subjects

predicted_only_validation_calibration_adjusted

Predicted values in the out-of-sample subjects after calibration adjustment

predicted_only_validation_adjusted_mandatory_predictors_only

Predicted values in the out-of-sample subjects using a model with only the mandatory predictors, but based on the coefficients of the entire model.

actual_all_subjects

Actual values in all subjects with outcomes.

predicted_all_subjects

Predicted values in all subjects with outcomes.

predicted_all_subjects_calibration_adjusted

Predicted values in all subjects with outcomes after calibration adjustment.

predicted_all_subjects_adjusted_mandatory_predictors_only

Predicted values in all subjects using a model with only the mandatory predictors, but based on the coefficients of the entire model.

lp_training

Linear predictors in the 'training' sample.

lp_only_validation

Linear predictors in the 'out-of-sample' subjects.

lp_all_subjects

Linear predictors in all subjects with outcomes.

lp_training_calibration_adjusted

Linear predictors in the training sample after calibration adjustment.

lp_only_validation_calibration_adjusted

Linear predictors in the 'out-of-sample' subjects after calibration adjustment.

lp_all_subjects_calibration_adjusted

Linear predictors in all subjects with outcomes after calibration adjustment.

lp_training_adjusted_mandatory_predictors_only

Linear predictors in the training sample using a model with only the mandatory predictors, but based on the coefficients of the entire model.

lp_only_validation_adjusted_mandatory_predictors_only

Linear predictors in the 'out-of-sample' subjects using a model with only the mandatory predictors, but based on the coefficients of the entire model.

lp_all_subjects_adjusted_mandatory_predictors_only

Linear predictors in all subjects with outcomes using a model with only the mandatory predictors, but based on the coefficients of the entire model.

time_training

Follow-up time in training sample (applicable only for time-to-event outcomes.)

time_only_validation

Follow-up time in 'out-of-sample' subjects (applicable only for time-to-event outcomes.)

time_all_subjects

Follow-up time in all subjects with outcomes (applicable only for time-to-event outcomes.)

regression_model

The regression model

html_file

Some output in html format, which will be used for final output.

outcome

Whether calculations could be made.

Author(s)

Kurinchi Gurusamy

References

Collins GS, Dhiman P, Ma J, Schlussel MM, Archer L, Van Calster B, et al. Evaluation of clinical prediction models (part 1): from development to external validation. Bmj. 2024;384:e074819.

Heymans MW, Twisk JWR. Handling missing data in clinical research. J Clin Epidemiol. 2022 Nov;151:185-188.

SAS Support. https://support.sas.com/kb/24/457.html (accessed on 16 January 2026).

Simon G, Aliferis C. Appendix A: Models for Time-to-Event Outcomes. In: Simon GJ, Aliferis C, editors. Artificial Intelligence and Machine Learning in Health Care and Medical Sciences: Best Practices and Pitfalls [Internet]. Cham (CH): Springer. https://www.ncbi.nlm.nih.gov/books/NBK610554/ (accessed on 13 December 2025). 2024.

Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Medicine. 2019;17(1):230.

See Also

prepare_datasets glm predict coxph basehaz predict.coxph roc coords

Examples

library(survival)
colon$status <- factor(as.character(colon$status))
# For testing, only 5 simulations are used here. Usually at least 300 to 500
# simulations are a minimum. Increasing the simulations leads to more reliable results.
# The default value of 2000 simulations should provide reasonably reliable results.
generic_input_parameters <- create_generic_input_parameters(
  general_title = "Prediction of colon cancer death", simulations = 5,
  simulations_per_file = 20, seed = 1, df = colon, outcome_name = "status",
  outcome_type = "time-to-event", outcome_time = "time", outcome_count = FALSE,
  verbose = FALSE)$generic_input_parameters
analysis_details <- cbind.data.frame(
  name = c('age', 'single_mandatory_predictor', 'complex_models',
           'complex_models_only_optional_predictors', 'predetermined_model_text'),
  analysis_title = c('Simple cut-off based on age', 'Single mandatory predictor (rx)',
                     'Multiple mandatory and optional predictors',
                     'Multiple optional predictors only', 'Predetermined model text'),
  develop_model = c(FALSE, TRUE, TRUE, TRUE, TRUE),
  predetermined_model_text = c(NA, NA, NA, NA,
  "cph(Surv(time, status) ~ rx * age, data = df_training_complete, x = TRUE, y = TRUE)"),
  mandatory_predictors = c(NA, 'rx', 'rx; differ; perfor; adhere; extent', NA, "rx; age"),
  optional_predictors = c(NA, NA, 'sex; age; nodes', 'rx; differ; perfor', NA),
  mandatory_interactions = c(NA, NA, 'rx; differ; extent', NA, NA),
  optional_interactions = c(NA, NA, 'perfor; adhere; sex; age; nodes', 'rx; differ', NA),
  model_threshold_method = c(NA, 'youden', 'youden', 'youden', 'youden'),
  scoring_system = c('age', NA, NA, NA, NA),
  predetermined_threshold = c('60', NA, NA, NA, NA),
  higher_values_event = c(TRUE, NA, NA, NA, NA)
)
write.csv(analysis_details, paste0(tempdir(), "/analysis_details.csv"),
          row.names = FALSE, na = "")
analysis_details_path <- paste0(tempdir(), "/analysis_details.csv")
# verbose is TRUE as default. If you do not want the outcome displayed, you can
# change this to FALSE
results <- create_specific_input_parameters(
  generic_input_parameters = generic_input_parameters,
  analysis_details_path = analysis_details_path, verbose = TRUE)
specific_input_parameters <- results$specific_input_parameters
# Set a seed for reproducibility - Please see details above
set.seed(generic_input_parameters$seed)
prepared_datasets <- {prepare_datasets(
  df = generic_input_parameters$df,
  simulations = generic_input_parameters$simulations,
  outcome_name = generic_input_parameters$outcome_name,
  outcome_type = generic_input_parameters$outcome_type,
  outcome_time = generic_input_parameters$outcome_time,
  verbose = TRUE)}
# There is no usually no requirement to call this function directly. This is used
# by the perform_analysis function to create the actual and predicted values.
specific_input_parameters_each_analysis <- specific_input_parameters[[1]]
actual_predicted_results_apparent <- {calculate_actual_predicted(
      prepared_datasets = prepared_datasets,
      outcome_name = generic_input_parameters$outcome_name,
      outcome_type = generic_input_parameters$outcome_type,
      outcome_time = generic_input_parameters$outcome_time,
      outcome_count = generic_input_parameters$outcome_count,
      develop_model = specific_input_parameters_each_analysis$develop_model,
      predetermined_model_text =
      specific_input_parameters_each_analysis$predetermined_model_text,
      mandatory_predictors = specific_input_parameters_each_analysis$mandatory_predictors,
      optional_predictors = specific_input_parameters_each_analysis$optional_predictors,
      mandatory_interactions = specific_input_parameters_each_analysis$mandatory_interactions,
      optional_interactions = specific_input_parameters_each_analysis$optional_interactions,
      model_threshold_method = specific_input_parameters_each_analysis$model_threshold_method,
      scoring_system = specific_input_parameters_each_analysis$scoring_system,
      predetermined_threshold = specific_input_parameters_each_analysis$predetermined_threshold,
      higher_values_event = specific_input_parameters_each_analysis$higher_values_event,
      each_simulation = 1, bootstrap_sample = FALSE, verbose = TRUE
    )}
  bootstrap_results <- lapply(1:generic_input_parameters$simulations,
  function(each_simulation) {
  calculate_actual_predicted(
    prepared_datasets = prepared_datasets,
    outcome_name = generic_input_parameters$outcome_name,
    outcome_type = generic_input_parameters$outcome_type,
    outcome_time = generic_input_parameters$outcome_time,
    outcome_count = generic_input_parameters$outcome_count,
    develop_model = specific_input_parameters_each_analysis$develop_model,
    predetermined_model_text =
      specific_input_parameters_each_analysis$predetermined_model_text,
    mandatory_predictors = specific_input_parameters_each_analysis$mandatory_predictors,
    optional_predictors = specific_input_parameters_each_analysis$optional_predictors,
    mandatory_interactions = specific_input_parameters_each_analysis$mandatory_interactions,
    optional_interactions = specific_input_parameters_each_analysis$optional_interactions,
    model_threshold_method = specific_input_parameters_each_analysis$model_threshold_method,
    scoring_system = specific_input_parameters_each_analysis$scoring_system,
    predetermined_threshold = specific_input_parameters_each_analysis$predetermined_threshold,
    higher_values_event = specific_input_parameters_each_analysis$higher_values_event,
    each_simulation = each_simulation, bootstrap_sample = TRUE, verbose = TRUE
  )
})

EQUALPrognosis documentation built on Feb. 4, 2026, 5:15 p.m.