pinterval_ccp: Clustered Conformal Prediction Intervals for Continuous...
In pintervals: Model Agnostic Prediction Intervals

pinterval_ccp

R Documentation

Clustered Conformal Prediction Intervals for Continuous Predictions

Description

This function computes conformal prediction intervals with a confidence level of 1 - \alpha by first grouping Mondrian classes into data-driven clusters based on the distribution of their nonconformity scores. The resulting clusters are used as strata for computing class-conditional (Mondrian-style) conformal prediction intervals. This approach improves local validity and statistical efficiency when there are many small or similar classes with overlapping prediction behavior. The coverage level 1 - \alpha is approximate within each cluster, assuming exchangeability of nonconformity scores within clusters.

The method supports additional features such as prediction calibration, distance-weighted conformal scores, and clustering optimization via internal validity measures (e.g., Calinski-Harabasz index or minimum cluster size heuristics).

Usage

pinterval_ccp(
  pred,
  pred_class = NULL,
  calib = NULL,
  calib_truth = NULL,
  calib_class = NULL,
  lower_bound = NULL,
  upper_bound = NULL,
  alpha = 0.1,
  ncs_type = c("absolute_error", "relative_error", "za_relative_error",
    "heterogeneous_error", "raw_error"),
  grid_size = 10000,
  resolution = NULL,
  n_clusters = NULL,
  cluster_method = c("kmeans", "ks"),
  cluster_train_fraction = 0.5,
  optimize_n_clusters = TRUE,
  optimize_n_clusters_method = c("calinhara", "min_cluster_size"),
  min_cluster_size = 150,
  min_n_clusters = 2,
  max_n_clusters = NULL,
  distance_weighted_cp = FALSE,
  distance_features_calib = NULL,
  distance_features_pred = NULL,
  distance_type = c("mahalanobis", "euclidean"),
  normalize_distance = "none",
  weight_function = c("gaussian_kernel", "caucy_kernel", "logistic", "reciprocal_linear")
)

Arguments

`pred`	Vector of predicted values
`pred_class`	A vector of class identifiers for the predicted values. This is used to group the predictions by class for Mondrian conformal prediction.
`calib`	A numeric vector of predicted values in the calibration partition, or a 2 column tibble or matrix with the first column being the predicted values and the second column being the truth values. If calib is a numeric vector, calib_truth must be provided.
`calib_truth`	A numeric vector of true values in the calibration partition. Only required if calib is a numeric vector
`calib_class`	A vector of class identifiers for the calibration set.
`lower_bound`	Optional minimum value for the prediction intervals. If not provided, the minimum (true) value of the calibration partition will be used. Primarily useful when the possible outcome values are outside the range of values observed in the calibration set. If not provided, the minimum (true) value of the calibration partition will be used.
`upper_bound`	Optional maximum value for the prediction intervals. If not provided, the maximum (true) value of the calibration partition will be used. Primarily useful when the possible outcome values are outside the range of values observed in the calibration set. If not provided, the maximum (true) value of the calibration partition will be used.
`alpha`	The confidence level for the prediction intervals. Must be a single numeric value between 0 and 1
`ncs_type`	A string specifying the type of nonconformity score to use. Available options are: `"absolute_error"`: `\|y - \hat{y}\|` `"relative_error"`: `\|y - \hat{y}\| / \hat{y}` `"zero_adjusted_relative_error"`: `\|y - \hat{y}\| / (\hat{y} + 1)` `"heterogeneous_error"`: `\|y - \hat{y}\| / \sigma_{\hat{y}}` absolute error divided by a measure of heteroskedasticity, computed as the predicted value from a linear model of the absolute error on the predicted values `"raw_error"`: the signed error `y - \hat{y}` The default is `"absolute_error"`.
`grid_size`	The number of points to use in the grid search between the lower and upper bound. Default is 10,000. A larger grid size increases the resolution of the prediction intervals but also increases computation time.
`resolution`	Alternatively to grid_size. The minimum step size between grid points. Useful if the a specific resolution is desired. Default is NULL.
`n_clusters`	Number of clusters to use when combining Mondrian classes. Required if `optimize_n_clusters = FALSE`.
`cluster_method`	Clustering method used to group Mondrian classes. Options are `"kmeans"` or `"ks"` (Kolmogorov-Smirnov). Default is `"kmeans"`.
`cluster_train_fraction`	Fraction of the calibration data used to estimate nonconformity scores and compute clustering. Default is 1, which uses the entire calibration set for both clustering and interval estimation. See details for more discussion.
`optimize_n_clusters`	Logical. If `TRUE`, the number of clusters is chosen automatically based on internal clustering criteria.
`optimize_n_clusters_method`	Method used for cluster optimization. One of `"calinhara"` (Calinski-Harabasz index) or `"min_cluster_size"`. Default is `"calinhara"`.
`min_cluster_size`	Minimum number of calibration points per cluster. Used only when `optimize_n_clusters_method = "min_cluster_size"`.
`min_n_clusters`	Minimum number of clusters to consider when optimizing.
`max_n_clusters`	Maximum number of clusters to consider. If `NULL`, the upper limit is set to the number of unique Mondrian classes minus 1.
`distance_weighted_cp`	Logical. If `TRUE`, weighted conformal prediction is performed where the non-conformity scores are weighted based on the distance between calibration and prediction points in feature space. Default is `FALSE`. See details for more information.
`distance_features_calib`	A matrix, data frame, or numeric vector of features from which to compute distances when `distance_weighted_cp = TRUE`. This should contain the feature values for the calibration set. Must have the same number of rows as the calibration set. Can be the predicted values themselves, or any other features which give a meaningful distance measure.
`distance_features_pred`	A matrix, data frame, or numeric vector of feature values for the prediction set. Must be the same features as specified in `distance_features_calib`. Required if `distance_weighted_cp = TRUE`.
`distance_type`	The type of distance metric to use when computing distances between calibration and prediction points. Options are 'mahalanobis' (default) and 'euclidean'.
`normalize_distance`	Either 'minmax', 'sd', or 'none'. Indicates if and how to normalize the distances when distance_weighted_cp is TRUE. Normalization helps ensure that distances are on a comparable scale across features. Default is 'none'.
`weight_function`	A character string specifying the weighting kernel to use for distance-weighted conformal prediction. Options are: `"gaussian_kernel"`: `w(d) = e^{-d^2}` `"caucy_kernel"`: `w(d) = 1/(1 + d^2)` `"logistic"`: `w(d) = 1//(1 + e^{d})` `"reciprocal_linear"`: `w(d) = 1/(1 + d)` The default is `"gaussian_kernel"`. Distances are computed as the Euclidean distance between the calibration and prediction feature vectors.

Details

'pinterval_ccp()' builds on [pinterval_mondrian()] by introducing a clustered conformal prediction framework. Instead of requiring a separate calibration distribution for every Mondrian class, which may lead to unstable or noisy intervals when there are many small groups, the method groups similar Mondrian classes into clusters with similar nonconformity score distributions. Classes with similar prediction-error behavior are assigned to the same cluster. Each resulting cluster is then treated as a stratum for standard inductive conformal prediction.

Users may specify the number of clusters directly using the 'n_clusters' argument or optimize the number of clusters using the Calinski–Harabasz index or minimum cluster size heuristics.

Clustering can be computed using all calibration data or a subsample defined by 'cluster_train_fraction'. By default, the entire calibration set is used for both clustering and interval estimation, which may lead to overfitting. Setting 'cluster_train_fraction' to a value less than 1 (e.g., 0.5) can help mitigate this risk by using separate data for clustering and interval estimation, at the cost of potentially less stable cluster assignments with smaller calibration subsets. If data is limited, using the full calibration set for clustering may still be preferable, but users should be aware of the potential for overfitting and optimistic coverage estimates in this case.

Clustering is based on either k-means or Kolmogorov-Smirnov distance between nonconformity score distributions of the Mondrian classes, selected via the 'cluster_method' argument.

For a detailed description of non-conformity scores, distance-weighting, and the general conformal prediction framework, see [pinterval_conformal()], and for a description of Mondrian conformal prediction, see [pinterval_mondrian()].

Value

A tibble with predicted values, lower and upper prediction interval bounds, class labels, and assigned cluster labels. Attributes include clustering diagnostics (e.g., cluster assignments, coverage gaps, internal validity scores).

Examples

library(dplyr)
library(tibble)

# Simulate data with 6 Mondrian classes forming 3 natural clusters
set.seed(123)
x1 <- runif(1000)
x2 <- runif(1000)
class_raw <- sample(1:6, size = 1000, replace = TRUE)

# Construct 3 latent clusters: (1,2), (3,4), (5,6)
mu <- ifelse(class_raw %in% c(1, 2), 1 + x1 + x2,
      ifelse(class_raw %in% c(3, 4), 2 + x1 + x2,
                               3 + x1 + x2))

sds <- ifelse(class_raw %in% c(1, 2), 0.5,
      ifelse(class_raw %in% c(3, 4), 0.3,
                        0.4))

y <- rlnorm(1000, meanlog = mu, sdlog = sds)

df <- tibble(x1, x2, class = factor(class_raw), y)

# Split into training, calibration, and test sets
df_train <- df %>% slice(1:500)
df_cal <- df %>% slice(501:750)
df_test <- df %>% slice(751:1000)

# Fit model (on log-scale)
mod <- lm(log(y) ~ x1 + x2, data = df_train)

# Generate predictions
pred_cal <- exp(predict(mod, newdata = df_cal))
pred_test <- exp(predict(mod, newdata = df_test))

# Apply clustered conformal prediction
intervals <- pinterval_ccp(
  pred = pred_test,
  pred_class = df_test$class,
  calib = pred_cal,
  calib_truth = df_cal$y,
  calib_class = df_cal$class,
  alpha = 0.1,
  ncs_type = "absolute_error",
  optimize_n_clusters = TRUE,
  optimize_n_clusters_method = "calinhara",
  min_n_clusters = 2,
  max_n_clusters = 4
)

# View clustered prediction intervals
head(intervals)

pintervals documentation built on March 3, 2026, 5:06 p.m.