Nothing
#' @title Create Nonlinear Probability of Survival Bins
#'
#' @description
#'
#' This function generates nonlinear bins for probability of survival data based
#' on specified thresholds and divisors as specified in Napoli et al. (2017),
#' Schroeder et al. (2019), and Kassar et al. (2016). This function calculates
#' bin statistics, including mean, standard deviation, total alive, total dead,
#' count, and percentage for each bin.
#'
#' @param data A `data.frame` or `tibble` containing the probability of survival
#' data for a set of patients.
#' @param Ps_col The name of the column containing the survival probabilities
#' (Ps). Should be numeric on a scale from 0 to 1.
#' @param outcome_col The name of the column containing the outcome data. It
#' should be binary, with values indicating patient survival. A value of `1`
#' should represent "alive" (survived), while `0` should represent "dead" (did
#' not survive). `TRUE/FALSE` are accepted as well. Ensure the column
#' contains only these possible values.
#' @param group_vars Optional grouping variables for bin statistics
#' calculations. These should be specified as quoted column names.
#' @param divisor1 A parameter to control the width of the probability of
#' survival range bins. Affects the creation of step sizes for the beginning
#' of each bin range. Defaults to `5`.
#' @param divisor2 A parameter to control the width of the probability of
#' survival range bins. Affects the creation of step sizes for the beginning
#' of each bin range. Defaults to `5`.
#' @param threshold_1 A parameter to decide where data indices will begin to
#' create step sizes. Defaults to `0.9`.
#' @param threshold_2 A parameter to decide where data indices will end to
#' create step sizes. Defaults to `0.99`.
#'
#' @details
#' Like other statistical computing functions, `nonlinear_bins()` is happiest
#' without missing data. It is best to pass complete probability of survival
#' and outcome data to the function for optimal performance. With smaller
#' datasets, this is especially helpful. However, `nonlinear_bins()` will throw
#' a warning about missing values, if any exist in `Ps_col` and/or
#' `outcome_col`.
#'
#' `nonlinear_bins()` assumes `Ps_col` contains probabilities derived from
#' real-world inputs for the Trauma Injury Severity Score (TRISS) model.
#' Synthetic or low-variability data (especially with small sample sizes) may
#' not reflect the distribution of TRISS-derived survival probabilities. This
#' can result in unstable estimates or function failure due to insufficient
#' dispersion. With small sample sizes, it may be important to use smaller
#' values with the divisor arguments and adjust the thresholds (based on the
#' distribution of the `Ps_col` values) to create bins that better accommodate
#' the data.
#'
#' By default, `nonlinear_bins()` derives bin cut points from the full dataset’s
#' distribution. This ensures comparability across groups when `group_vars` is
#' used. To tailor binning to a specific group (e.g., a single hospital), filter
#' the dataset to that subgroup before calling `nonlinear_bins()`. The function
#' will then compute bins and related statistics using only that subset’s
#' `Ps_col` distribution. When `group_vars` is used, and ff a group lacks
#' observations within one or more bins, `rm_bin_summary()` will compute
#' statistics only for the bins that contain data. Bins with no observations are
#' excluded from the summary for that group.
#'
#' @returns A list with two elements:
#' - `intervals`: A vector defining bin boundaries for probability of
#' survival.
#' - `bin_stats`: A `tibble` containing:
#' - `bin_number`: Bin index.
#' - `bin_start`, `bin_end`: Bin range.
#' - `mean`, `sd`: Mean and standard deviation of `Ps_col` within the bin.
#' - `Pred_Survivors_b`, `Pred_Deaths_b`: Predicted counts of survivors and
#' decedents, respectively.
#' - `AntiS_b`, `AntiM_b`: Anticipated proportion survived, and deceased,
#' respectively.
#' - `alive`, `dead`: Count of observed survivors and non-survivors.
#' - `count`: Total records in the bin.
#' - `percent`: Percentage of total records within each bin.
#'
#' @note
#'
#' This function will produce the most reliable and interpretable results when
#' using a dataset that has one row per patient, with each column being a
#' feature.
#'
#' The `mean` and `AntiS_b` are approximately equivalent in this context. They
#' are kept in the output for clarity.
#'
#' @export
#'
#' @references
#'
#' Kassar, O.M., Eklund, E.A., Barnhardt, W.F., Napoli, N.J., Barnes, L.E.,
#' Young, J.S. (2016). Trauma survival margin analysis: A dissection of trauma
#' center performance through initial lactate. The American Surgeon, 82(7),
#' 649-653. <doi:10.1177/000313481608200733>
#'
#' Napoli, N. J., Barnhardt, W., Kotoriy, M. E., Young, J. S., & Barnes, L. E.
#' (2017). Relative mortality analysis: A new tool to evaluate clinical
#' performance in trauma centers. IISE Transactions on Healthcare Systems
#' Engineering, 7(3), 181–191. <doi:10.1080/24725579.2017.1325948>
#'
#' Schroeder, P. H., Napoli, N. J., Barnhardt, W. F., Barnes, L. E., & Young, J.
#' S. (2018). Relative mortality analysis of the “golden hour”: A comprehensive
#' acuity stratification approach to address disagreement in current literature.
#' Prehospital Emergency Care, 23(2), 254–262.
#' <doi:10.1080/10903127.2018.1489021>
#'
#' @seealso [probability_of_survival()], [rmm()], and [rm_bin_summary()]
#'
#' @examples
#' # Generate example data
#' set.seed(123)
#'
#' # Parameters
#' # Total number of patients
#' n_patients <- 5000
#'
#' # Arbitrary group labels
#' groups <- sample(x = LETTERS[1:2], size = n_patients, replace = TRUE)
#'
#' # Trauma types
#' trauma_type_values <- sample(
#' x = c("Blunt", "Penetrating"),
#' size = n_patients,
#' replace = TRUE
#' )
#'
#' # RTS values
#' rts_values <- sample(
#' x = seq(from = 0, to = 7.8408, by = 0.005),
#' size = n_patients,
#' replace = TRUE
#' )
#'
#' # patient ages
#' ages <- sample(
#' x = seq(from = 0, to = 100, by = 1),
#' size = n_patients,
#' replace = TRUE
#' )
#'
#' # ISS scores
#' iss_scores <- sample(
#' x = seq(from = 0, to = 75, by = 1),
#' size = n_patients,
#' replace = TRUE
#' )
#'
#' # Generate survival probabilities (Ps)
#' Ps <- traumar::probability_of_survival(
#' trauma_type = trauma_type_values,
#' age = ages,
#' rts = rts_values,
#' iss = iss_scores
#' )
#'
#' # Simulate survival outcomes based on Ps
#' survival_outcomes <- rbinom(n_patients, size = 1, prob = Ps)
#'
#' # Create data frame
#' data <- data.frame(Ps = Ps, survival = survival_outcomes, groups = groups) |>
#' dplyr::mutate(death = dplyr::if_else(survival == 1, 0, 1))
#'
#' # Apply the nonlinear_bins function
#' results <- nonlinear_bins(
#' data = data,
#' Ps_col = Ps,
#' outcome_col = survival,
#' divisor1 = 4,
#' divisor2 = 4,
#' threshold_1 = 0.9,
#' threshold_2 = 0.99
#' )
#'
#' # View results
#' results$intervals
#' results$bin_stats
#'
#' # Example with grouping by a categorical variable
#'
#' # Run the function using a single grouping variable
#' results_grouped <- nonlinear_bins(
#' data,
#' Ps_col = Ps,
#' outcome_col = survival,
#' group_vars = "groups"
#' )
#'
#' # View grouped results
#' results_grouped$bin_stats
#'
#' @author Nicolas Foss, Ed.D, MS, original implementation in MATLAB by Nicholas
#' J. Napoli, Ph.D., MS
#'
nonlinear_bins <- function(
data,
Ps_col,
outcome_col,
group_vars = NULL,
divisor1 = 5,
divisor2 = 5,
threshold_1 = 0.9,
threshold_2 = 0.99
) {
# Validation checks using `cli` for robust error messaging:
# Ensures the input data is a data frame or tibble.
if (!is.data.frame(data) && !tibble::is_tibble(data)) {
cli::cli_abort("The input data must be a data frame or tibble.")
}
# Ensure Ps_col and outcome_col arguments are provided with tailored error messages
if (missing(Ps_col) && missing(outcome_col)) {
cli::cli_abort(
"Both {.var Ps_col} and {.var outcome_col} arguments must be provided."
)
} else if (missing(Ps_col)) {
cli::cli_abort("The {.var Ps_col} argument must be provided.")
} else if (missing(outcome_col)) {
cli::cli_abort("The {.var outcome_col} argument must be provided.")
}
# Check if Ps column is numeric
# dplyr::pull the Ps data
Ps_check <- data |> dplyr::pull({{ Ps_col }})
# check the Ps_check remains continuous
if (!is.numeric(Ps_check)) {
cli::cli_abort("The {.var Ps_col} must contain numeric values.")
}
# Check if Ps column is continuous (values between 0 and 1)
if (any(Ps_check < 0 | Ps_check > 1, na.rm = TRUE)) {
cli::cli_abort(
"The probability of survival (Ps) values must be between 0 and 1."
)
}
if (any(is.na(Ps_check))) {
cli::cli_warn(
"Missing values detected in {.var Ps_col}; please apply an appropriate treatment to the missings and rerun {.fn nonlinear_bins}."
)
}
# Pull and check the outcome column
binary_data <- data |> dplyr::pull({{ outcome_col }})
# Ensure the column is either logical or numeric
if (!is.logical(binary_data) && !is.numeric(binary_data)) {
cli::cli_abort(
"The {.var outcome_col} must be of type logical (TRUE/FALSE) or numeric (1/0)."
)
}
# Get unique non-missing values
non_missing <- stats::na.omit(binary_data)
# Validate type and values
if (is.logical(binary_data)) {
# Logical vector: ensure only TRUE/FALSE (no coercion needed)
invalid_vals <- setdiff(unique(non_missing), c(TRUE, FALSE))
if (length(invalid_vals) > 0) {
cli::cli_abort(
"The {.var outcome_col} contains invalid logical values: {.val {invalid_vals}}."
)
}
} else if (is.numeric(binary_data)) {
# Numeric vector: ensure strictly 0 or 1
invalid_vals <- setdiff(unique(non_missing), c(0, 1))
if (length(invalid_vals) > 0) {
cli::cli_abort(
"The {.var outcome_col} contains numeric values other than 0 and 1: {.val {invalid_vals}}."
)
}
} else {
# Not logical or numeric
cli::cli_abort(
"The {.var outcome_col} must be either logical (TRUE/FALSE) or numeric (1/0)."
)
}
# Warn if missing
if (any(is.na(binary_data))) {
cli::cli_warn(
"Missing values detected in {.var outcome_col}; please apply an appropriate treatment to the missings and rerun {.fn nonlinear_bins}."
)
}
# Check if all elements in group_vars are strings (i.e., character vectors)
if (!all(sapply(group_vars, is.character))) {
cli::cli_abort(c(
"All elements in {.var group_vars} must be strings.",
"i" = "You passed a {.cls {class(group_vars)}} variable to {.var group_vars}."
))
}
# Check if all group_vars exist in the data
if (!all(group_vars %in% names(data))) {
invalid_vars <- group_vars[!group_vars %in% names(data)]
cli::cli_abort(
"The following group variable(s) are not valid columns in the data: {paste(invalid_vars, collapse = ', ')}"
)
}
# Treat the column-names-as-strings as symbols
if (!is.null(group_vars)) {
group_vars <- rlang::syms(group_vars)
}
# Select and sort the column
survival_data <- data |> dplyr::pull({{ Ps_col }}) |> sort()
total <- length(survival_data)
# length of non-missing `Ps_col` must be >= 2
if (na.omit(total) < 2) {
cli::cli_abort(c(
"At least two non-missing values are required in {.var Ps_col} to compute survival probability intervals.",
"v" = "Ensure {.var Ps_col} contains at least two valid, non-missing numeric entries."
))
}
# Step 1: Find indices for level thresholds
loc_9A <- which(survival_data > threshold_1) # Everything above 0.9 or other threshold
loc_9B <- which(survival_data > threshold_2) # Everything above 0.99 or other threshold
loc_9C <- which(survival_data > threshold_1 & survival_data <= threshold_2) # Between 0.9 and 0.99 or other thresholds
# Step 2: Define step sizes based on the data
step1 <- round(suppressWarnings(min(loc_9A, na.rm = TRUE)) / divisor1)
step2 <- round(length(loc_9C) / divisor2)
# Step 3: Define intervals
# Check that loc_9A and loc_9B are finite before using them in seq()
if (
suppressWarnings(
is.finite(min(loc_9A)) && is.finite(min(loc_9B))
)
) {
# Defensive approach to control for potential error when `Ps_col`
# is not sufficiently dispersed and/or has a low sample size.
attempt <- try(
unique(
c(
seq(
from = 1,
to = suppressWarnings(min(loc_9A, na.rm = TRUE)),
by = step1
),
seq(
from = suppressWarnings(min(loc_9A, na.rm = TRUE)),
to = suppressWarnings(min(loc_9B, na.rm = TRUE)),
by = step2
),
suppressWarnings(max(loc_9B, na.rm = TRUE))
)
),
silent = TRUE
)
if (inherits(attempt, "try-error")) {
cli::cli_abort(c(
"Unable to calculate valid step sizes for defining survival probability intervals due to non-finite values.",
"i" = "This typically occurs when {.var Ps_col} has low variability or the sample size is too small to support the requested binning.",
"v" = "Try reducing {.var divisor1} and {.var divisor2} to create fewer, broader bins.",
"v" = "Examine the distribution of {.var Ps_col} and consider adjusting {.var threshold_1} and {.var threshold_2} to better match the data.",
"v" = "Increasing the sample size may also improve dispersion and allow for more stable interval boundaries."
))
} else {
# If `Ps_col` meets distributional assumptions, proceed
len <- attempt
}
} else {
# If either of the parts of the interval calculation are not finite throw an
# error
cli::cli_abort(
c(
"Unable to calculate valid step sizes for defining survival probability intervals due to non-finite values.",
"i" = "This typically occurs when {.var Ps_col} has low variability or the sample size is too small to support the requested binning.",
"v" = "Try reducing {.var divisor1} and {.var divisor2} to create fewer, broader bins.",
"v" = "Examine the distribution of {.var Ps_col} and consider adjusting {.var threshold_1} and {.var threshold_2} to better match the data.",
"v" = "Increasing the sample size may also improve dispersion and allow for more stable interval boundaries."
)
)
}
# Generate intervals based on these positions
intervals <- unique(survival_data[len])
# Step 4: Bin statistics
# Apply binning to each group separately
data <- data |>
dplyr::mutate(
bin_number = .bincode(
{{ Ps_col }},
breaks = intervals,
include.lowest = TRUE
),
bin_start = intervals[bin_number], # Start of the bin
bin_end = intervals[bin_number + 1] # End of the bin
)
# Optionally group data by dynamic group_vars
# Or run the bin statistics on the whole dataset
if (!is.null(group_vars)) {
grouped_stats <- data |>
dplyr::group_by(!!!group_vars, bin_number, bin_start, bin_end) |>
dplyr::summarize(
mean = mean({{ Ps_col }}, na.rm = TRUE),
sd = stats::sd({{ Ps_col }}, na.rm = TRUE),
Pred_Survivors_b = sum({{ Ps_col }}, na.rm = TRUE),
Pred_Deaths_b = sum(1 - {{ Ps_col }}, na.rm = TRUE),
AntiS_b = dplyr::if_else(
dplyr::n() > 0,
Pred_Survivors_b / dplyr::n(),
NA_real_
),
AntiM_b = dplyr::if_else(
dplyr::n() > 0,
Pred_Deaths_b / dplyr::n(),
NA_real_
),
alive = sum({{ outcome_col }} == 1, na.rm = TRUE),
dead = sum({{ outcome_col }} == 0, na.rm = TRUE),
count = dplyr::n(),
.groups = "drop"
) |>
dplyr::ungroup() |>
dplyr::group_by(!!!group_vars) |>
dplyr::mutate(
percent = count / sum(count, na.rm = TRUE)
) |>
dplyr::ungroup()
} else {
grouped_stats <- data |>
dplyr::group_by(bin_number, bin_start, bin_end) |>
dplyr::summarize(
mean = mean({{ Ps_col }}, na.rm = TRUE),
sd = stats::sd({{ Ps_col }}, na.rm = TRUE),
Pred_Survivors_b = sum({{ Ps_col }}, na.rm = TRUE),
Pred_Deaths_b = sum(1 - {{ Ps_col }}, na.rm = TRUE),
AntiS_b = dplyr::if_else(
dplyr::n() > 0,
Pred_Survivors_b / dplyr::n(),
NA_real_
),
AntiM_b = dplyr::if_else(
dplyr::n() > 0,
Pred_Deaths_b / dplyr::n(),
NA_real_
),
alive = sum({{ outcome_col }} == 1, na.rm = TRUE),
dead = sum({{ outcome_col }} == 0, na.rm = TRUE),
count = dplyr::n(),
.groups = "drop"
) |>
dplyr::mutate(
percent = count / sum(count, na.rm = TRUE)
)
}
# Return a list with intervals and the bin statistics
return(list(intervals = intervals, bin_stats = grouped_stats))
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.