knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", fig.width = 7, fig.height = 6, cache = TRUE )
This vignette provides an overview of how to use the functions in the catalytic package that focuses on COX Regression. The other catalytic vignettes go into other model-estimating functions.
The goal of the catalytic package is to build framework for catalytic prior distributions. Stabilizing high-dimensional working models by shrinking them towards simplified models. This is achieved by supplementing observed data with weighted synthetic data generated from a predictive distribution under the simpler model. For more information, see [@catalytic_cox].
The two steps of using catalytic package for COX Regression are
Initialization: The cat_cox_initialization
function constructs a cat_init
object based on the formula provided by the user to generate synthetic data. The resulting cat_init
object is tailored to facilitate further analysis, and is integral for subsequent modeling steps in the catalytic
package.
Choose Method(s): Users have the flexibility to choose from four main functions within the catalytic package: cat_cox
, cat_cox_tune
, cat_cox_bayes
, and cat_cox_bayes_joint
. Each function serves a specific purpose in modeling with catalytic priors and offers distinct capabilities tailored to different modeling scenarios for COX Regression. This approach enables users to seamlessly incorporate synthetic data with varying weights from different method into COX Regression analyses, providing flexibility and control over the modeling process.
Creating a high-dimensional dataset with a low data size. This step involves increasing the number of features (dimensions) while keeping the number of observations (data size) relatively small. This is useful for testing the performance of catalytic
models in high-dimensional settings.
The pbc
dataset is loaded and split into training (train_data
) and test (cox_cox_cat_test
) datasets.
library(survival) library(catalytic) set.seed(1) # Compute the Partial Likelihood for the Cox Proportional Hazards Model get_cox_partial_likelihood <- function(X, time, status, coefs, entry_points) { # Identify the risk set indices for Cox proportional hazards model get_cox_risk_set_idx <- function(time_of_interest, entry_vector, time_vector, status_vector) { time_vector <- as.vector(time_vector) entry_vector <- as.vector(entry_vector) status_vector <- as.vector(status_vector) # Find indices where subjects are at risk at the given time of interest return(which((time_of_interest >= entry_vector) & ( (time_vector == time_of_interest & status_vector == 1) | ( time_vector + 1e-08 > time_of_interest)))) } X <- as.matrix(X) time <- as.vector(time) status <- as.vector(status) pl <- 0 # Calculate partial likelihood for each censored observation for (i in which(status == 1)) { risk_set_idx <- get_cox_risk_set_idx( time_of_interest = time[i], entry_vector = entry_points, time_vector = time, status_vector = status ) # Compute the linear predictors for the risk set exp_risk_lp <- exp(X[risk_set_idx, , drop = FALSE] %*% coefs) # Update partial likelihood pl <- pl + sum(X[i, , drop = FALSE] * coefs) - log(sum(exp_risk_lp)) } return(pl) } # Load pbc dataset for Cox proportional hazards model data("pbc") pbc <- pbc[stats::complete.cases(pbc), 2:20] # Remove NA values and `id` column pbc$trt <- ifelse(pbc$trt == 1, 1, 0) # Convert trt to binary pbc$sex <- ifelse(pbc$sex == "f", 1, 0) # Convert sex to binary pbc$edema2 <- ifelse(pbc$edema == 0.5, 1, 0) # Convert edema2 to binary pbc$edema <- ifelse(pbc$edema == 1, 1, 0) # Convert edema to binary pbc$time <- pbc$time / 365 # Convert time to years pbc$status <- (pbc$status == 2) * 1 # Convert status to binary # Identify columns to be standardized not_standarlized_index <- c(1, 2, 3, 5, 6, 7, 8, 9, 20) # Standardize the columns pbc[, -not_standarlized_index] <- scale(pbc[, -not_standarlized_index]) # Seperate observation data into train and test data train_n <- (ncol(pbc) - 2) * 5 train_idx <- sample(1:nrow(pbc), train_n) train_data <- pbc[train_idx, ] test_data <- pbc[-train_idx, ] test_x <- test_data[, -which(names(test_data) %in% c("time", "status"))] test_time <- test_data$time test_status <- test_data$status test_entry_points <- rep(0, nrow(test_x)) dim(train_data)
In this section, we explore the foundation steps of fitting a COX regression model (GLM) using the survival::coxph
function with the gaussian family.
We then proceed to evaluate model performance using the catalytic::get_discrepancy
function, specifically calculating the logarithmic error discrepancy. This involves comparing the actual response (Y
) against the estimated response (est_Y
) derived from the COX regression model fitted on the train_data
.
Finally, we display the computed logarithmic error to assess the model's performance in predicting the response variable mpg
.
# Calculate MPLE (Marginal Partial Likelihood Estimate) prediction score with 0 coefficients MPLE_0 <- get_cox_partial_likelihood( X = test_x, time = test_time, status = test_status, coefs = rep(0, (ncol(test_x))), entry_points = test_entry_points ) # Fit a COX regression model cox_model <- survival::coxph( formula = survival::Surv(time, status) ~ ., data = train_data ) # Calculate MPLE (Marginal Partial Likelihood Estimate) prediction score of estimated coefficients minus 0 coefficients cat( "MLE COX Model - Predition Score:", get_cox_partial_likelihood( X = test_x, time = test_time, status = test_status, coefs = coef(cox_model), entry_points = test_entry_points ) - MPLE_0 )
catalytic
To initialize data for a Cox Proportional-Hazards Model using the catalytic
package, the cat_cox_initialization
function is employed. This function facilitates the setup by preparing synthetic data tailored for modeling purposes.
Here's a breakdown of the parameters used:
formula
: Specifies the model formula, indicating the time and status variables used for the Cox model. See ?survival::coxph
for details on formula syntax.
data
: Represents the original dataset, structured as a data.frame
.
syn_size
: Determines the size of the synthetic data generated, defaulted to four times the number of columns in the original data.
hazard_constant
: A constant used in the hazard rate calculation. Default is the sum of observed statuses divided by the mean observed time divided by the number of observations.
entry_points
: An optional vector specifying the entry points for the data. Default is a vector of zeros.
x_degree
: Determines the degree of the polynomial terms for the predictors in the model, which affects the diversity or complexity of the predictor terms included in the model. Defaulted to NULL
, which means the degree for all predictors is 1.
resample_only
: Determines whether synthetic data is exclusively generated by resampling from the original dataset. By default, it is set to FALSE
, allowing the function to apply various resampling strategies based on the characteristics of dataset.
na_replace
: Specifies the method for handling missing values in the dataset, defaulted to stats::na.omit
.
...
(ellipsis): Allows users to specify additional arguments that are passed directly to the simple model. This model is used within the function to generate synthetic responses by fitting observation data using survival::coxph
. By leveraging ...
, users can customize the behavior of survival::coxph
within cat_cox_initialization
with additional options.
cat_init <- cat_cox_initialization( formula = survival::Surv(time, status) ~ 1, data = train_data, syn_size = 100, # Default: 4 times the number of predictors hazard_constant = 1, # Default: calculated from observed data entry_points = rep(0, nrow(train_data)), # Default: Null, which is vector of zeros x_degree = NULL, # Default: NULL, which means degrees of all predictors are 1 resample_only = FALSE, # Default: FALSE na_replace = mean # Default: stats::na.omit ) cat_init
Here shows how users can simplify the input for cat_cox_initialization
. User do not have to specify syn_size
and other parameters, as they have default values, which mentioned above.
cat_init
object contains a list of attributes, which is typically generated from the above function [cat_cox_initialization
]{#Initialization}. These attributes provide comprehensive information for subsequent modeling tasks or for user checks.
Here's a breakdown of all attributes except the input parameters:
function_name
: The name of this function, which is cat_cox_initialization.
obs_size
: The number of observations (rows) in the original dataset (obs_data
).
obs_data
: The original dataset (data
).
obs_x
: The covariates from the original dataset (obs_data
).
obs_time
: The time variable from the original dataset (obs_data
).
obs_status
: The status variable from the original dataset (obs_data
).
syn_size
: The size of synthetic data generated.
syn_data
: The synthetic data created for modeling purposes, based on the original dataset (obs_data
) characteristics.
syn_x
: The covariates from the synthetic dataset (syn_data
).
syn_time
: The time variable from the synthetic dataset (syn_data
).
syn_status
: The status variable from the synthetic dataset (syn_data
).
syn_x_resample_inform
: The information detailing the process of resampling synthetic data.
size
: The total size of the combined dataset (obs_size
and syn_size
).
data
: The combined dataset (obs_data
and syn_data
).
x
: The combined covariates (obs_x
and syn_x
).
time
: The combined time variable (obs_time
and syn_time
).
status
: The combined status variable (obs_status
and syn_status
).
For more details, please check ?cat_cox_initialization
.
names(cat_init)
And of course, user can extract items mentioned above from cat_cox_initialization
object.
# The number of observations (rows) in the original dataset (`obs_data`) cat_init$obs_size # The information detailing the process of resampling synthetic data cat_init$syn_x_resample_inform
The cat_cox
function fits a Generalized COX Model (GLM) with a catalytic prior on the regression coefficients. It utilizes information from the cat_init
object generated during the initialization step, which includes both observed and synthetic data, plus other relevant information.
The model is then fitted using the specified formula, family, and a single tau
(synthetic data down-weight factor). The resulting cat_cox
object encapsulates the fitted model, including estimated coefficients and family information, facilitating further analysis.
Here's a breakdown of the parameters used:
To fit a Cox Proportional-Hazards Model using the catalytic
package, the cat_cox
function is employed. This function facilitates the setup by preparing synthetic data tailored for modeling purposes.
Here's a breakdown of the parameters used:
formula
: This parameter specifies the model formula used in the Cox model. It defines the relationship between the response variable and the predictors. Alternatively, besides using in format survival:Surv(TIME, STATUS) ~ COVARIATES
, user can also use ~ COVARIATES
without specifying the response name, since the response name is defined in the initialization step.
cat_init
: This parameter is essential and represents the initialization object (cat_init
) created by using cat_cox_initialization
. It contains both observed and synthetic data, plus other relevant information, necessary for model fitting.
tau
: This parameter determines the down-weight assigned to synthetic data relative to observed data. It influences the influence of synthetic data in the model fitting process. If not specified (NULL
), it defaults to a one forth of the number of predictors.
method
: The method for fitting the Cox model, either "CRE" (Catalytic-regularized Estimator) or "WME" (weighted Mixture Estimator). Default is "CRE".
init_coefficients
: Initial coefficient values before iteration. Defaults to zero if not provided (if using CRE
).
tol
: Convergence tolerance for iterative methods. Default is 1e-5
(if using CRE
).
max_iter
: Maximum number of iterations allowed for convergence. Default is 25
(if using CRE
).
cat_cox_model <- cat_cox( formula = survival::Surv(time, status) ~ ., # Same as `~ .` cat_init = cat_init, tau = 10, # Default: number of predictors method = "WME", # Default: CRE init_coefficients = NULL, # Default: all zeros tol = 0.01, # Default: 1e-5 max_iter = 5 # Default: 25 ) cat_cox_model
Here shows how users can simplify the input for cat_cox
. User do not have to specify tau
and so on, as tau
and other parameters has their own default value , which mentioned above.
Let's check the prediction score.
cat( "Catalytic `cat_cox` - Predition Score:", get_cox_partial_likelihood( X = test_x, time = test_time, status = test_status, coefs = coef(cat_cox_model), entry_points = test_entry_points ) - MPLE_0 )
Both cat_cox_model
and cat_cox_model
objects are outputs from the cat_cox
function, providing a list of attributes for further analysis or user inspection.
Here's a breakdown of all attributes except the input parameters:
function_name
: The name of this function, which is cat_cox.
coefficients
: The coefficients of the fitted Cox model.
model
: The fitted survival model (only for "WME" method).
iter
: The number of iterations performed (only for "CRE" method).
iteration_log
: The collected coefficients for each iteration (only for "CRE" method).
For more details, please check ?cat_cox
.
names(cat_cox_model)
User can extract items mentioned above from cat_cox
object.
# The formula used for modeling cat_cox_model$formula # The fitted model object obtained from `survival::coxph` with `tau` cat_cox_model$coefficients
The cat_cox_tune
function fits a with a catalytic prior on the regression coefficients and provides options for optimizing model performance over a range of tau values(tau_seq
).
These methods empower users to fit and optimize models with catalytic priors, leveraging both observed and synthetic data to enhance model performance and robustness in various statistical analyses.
This method computes the partial likelihood across a specified range of tau values (tau_seq
). It iterates through each tau value, evaluating its performance based on cross-validation folds (cross_validation_fold_num
) to select the optimal tau that minimizes the discrepancy error.
Here's a breakdown of the parameters used:
formula
, cat_init
and method
are same as above.
tau_seq
: Vector of positive numeric values for down-weighting synthetic data. Defaults to a sequence around the number of predictors.
cross_validation_fold_num
: Number of folds for cross-validation. Defaults to 5.
...
(ellipsis): Additional arguments passed to the cat_cox
function for model fitting.
cat_cox_tune_model <- cat_cox_tune( formula = survival::Surv(time, status) ~ ., # Same as `~ .` cat_init = cat_init, method = "CRE", # Default: "CRE" tau_seq = seq(1, 5, 0.5), # Default is a numeric sequence around the number of predictors cross_validation_fold_num = 3, # Default: 5 tol = 1 # Additional arguments passed to the `cat_cox` function ) cat_cox_tune_model
User can plot the tau_seq (x) against discrepancy error (y) using the plot()
function. This plot will show the lowest discrepancy error at the optimal tau value.
plot(cat_cox_tune_model, text_pos = 2, legend_pos = "bottomright")
Here shows how users can simplify the input for cat_cox_tune
. User do not have to specify tau_seq
or some other augments, as most of the augments has default values, which mentioned above.
Let's check the prediction score.
cat( "Catalytic `cat_cox_tune` - Predition Score:", get_cox_partial_likelihood( X = test_x, time = test_time, status = test_status, coefs = coef(cat_cox_tune_model), entry_points = test_entry_points ) - MPLE_0 )
cat_cox_tune_model
and cat_cox_tune_model
objects are outputs from the cat_cox_tune
function, providing a list of attributes for further analysis or user inspection.
Here's a breakdown of all attributes except the input parameters:
function_name
: The name of the function (cat_cox_tune
).
likelihood_list
: The collected likelihood values for each tau.
tau
: Selected optimal tau value from tau_seq
that maximizes likelihood score.
model
: The fitted COX model object obtained with the selected optimal tau (tau
).
coefficients
: The estimated coefficients from the fitted COX model (model
).
For more details, please check ?cat_cox_tune
.
names(cat_cox_tune_model)
User can extract items mentioned above from cat_cox_tune
object.
# The estimated coefficients from the fitted model cat_cox_tune_model$coefficients # Selected optimal tau value from `tau_seq` that maximize the likelihood cat_cox_tune_model$tau
Now, we will explore advanced Bayesian modeling techniques tailored for COX using the catalytic
package. Bayesian inference offers a powerful framework to estimate model parameters and quantify uncertainty by integrating prior knowledge with observed data.
Below functions enable Bayesian inference for COX Regression Model with catalytic priors. This function utilizes Markov chain Monte Carlo (MCMC) methods, implemented using the rstan
package, to sample from the posterior distribution of model parameters. Users can specify various options such as the number of MCMC chains (chains
), iterations (iter
) and warmup steps (warmup
). User could also apply other attributes to rstan::sampling
, like refresh
and control
.
In this section, we explore Bayesian approaches using the cat_cox_bayes
function from the catalytic
package. This function can fit a Bayesian Generalized COX Model (GLM) using a fixed tau value. The MCMC sampling process will generate posterior distributions for the coefficients based on the specified tau.
Here's a breakdown of the parameters used:
formula
, cat_init
and tau
are same as above.
chains
: Number of Markov chains to run during MCMC sampling in rstan::sampling
. Defaults to 4.
iter
: Total number of iterations per chain in the MCMC sampling process in rstan::sampling
. Defaults to 2000.
warmup
: Number of warmup iterations in the MCMC sampling process in rstan::sampling
, discarded as burn-in. Defaults to 1000.
hazard_beta
: Shape parameter for the Gamma distribution in the hazard model. Defaults to 2.
...
(ellipsis): Denotes additional arguments that can be passed directly to the underlying rstan::sampling
function used within cat_cox_bayes
to fit the Bayesian model. These arguments allow for customization of the Bayesian fitting process, such as control
, refresh
, or other model-specific settings.
For more details, please refer to ?cat_cox_bayes
.
cat_cox_bayes_model <- cat_cox_bayes( formula = survival::Surv(time, status) ~ ., # Same as `~ . cat_init = cat_init, tau = 1, # Default: NULL chains = 1, # Default: 4 iter = 100, # Default: 2000 warmup = 50, # Default: 1000 hazard_beta = 2 # Default: 2 ) cat_cox_bayes_model
Here shows how users can simplify the input for cat_cox_bayes
. User do not have to specify tau
and other attributes, as tau
and other attributes have default value, which mentioned above. Here we assign lower value to chains
, iter
and warmup
for quicker processing time.
User can also get the traceplot of the rstan
model by using traceplot()
directly into the output from cat_cox_bayes
.
traceplot(cat_cox_bayes_model)
Plus, user can use this catlaytic::traceplot
just like the rstan::traceplot
, user can add parameters used in rstan::traceplot
, like include
and inc_warmup
.
traceplot(cat_cox_bayes_model, inc_warmup = TRUE)
Let's check the prediction score.
cat( "Catalytic `cat_cox_bayes` - Predition Score:", get_cox_partial_likelihood( X = test_x, time = test_time, status = test_status, coefs = coef(cat_cox_bayes_model), entry_points = test_entry_points ) - MPLE_0 )
Both cat_cox_bayes_model
and cat_cox_bayes_model
objects are outputs from the cat_cox_bayes
function, providing a list of attributes for further analysis or user inspection.
Here's a breakdown of all attributes except the input parameters:
function_name
: The name of the function (cat_cox_bayes
).
stan_data
: The data list used for rstan::sampling
.
stan_model
: The rstan::stan_model
object used for Bayesian modeling.
stan_sample_model
: The result object obtained from rstan::sampling
, encapsulating the MCMC sampling results.
coefficients
: The mean estimated coefficients from the Bayesian model, extracted from rstan::summary(stan_sample_model)$summary
.
increment_cumulative_baseline_hazard
: The mean of the increment in cumulative baseline hazard extracted from rstan::summary(stan_sample_model)$summary
.
For more details, please refer to ?cat_cox_bayes
.
names(cat_cox_bayes_model)
User can extract items mentioned above from cat_cox_bayes
object.
# The number of Markov chains used during MCMC sampling in `rstan`. cat_cox_bayes_model$chain # The mean estimated coefficients from the Bayesian model cat_cox_bayes_model$coefficients
In this section, we delve into Bayesian methodologies employing the cat_cox_bayes_joint
function within the catalytic
package. Unlike its non-adaptive counterpart (cat_cox_bayes
), this method employs a joint tau prior approach where tau is treated as a parameter within the MCMC sampling process, improving the robustness and accuracy of parameter estimation in Bayesian gaussian modeling.
In this section, we explore Bayesian approaches using the cat_cox_bayes_joint
function from the catalytic
package. These functions are similar to their non-adaptive (non-joint) version (cat_cox_bayes
), but corporate tau
into the MCMC sampling process.
Here's a breakdown of the parameters used:
formula
and cat_init
are same as above.
chains
, iter
, warmup
and hazard_beta
are same in section Bayesian Posterior Sampling with Fixed Tau.
tau_alpha
: Alpha parameter controlling degrees of freedom for distribution in the joint tau approach. Default is 2.
tau_gamma
: Gamma parameter in the joint tau approach. Default is 1.
...
(ellipsis): Denotes additional arguments that can be passed directly to the underlying rstan::sampling
function used within cat_cox_bayes_joint
to fit the Bayesian model. These arguments allow for customization of the Bayesian fitting process, such as control
, refresh
, or other model-specific settings.
For more details, please refer to ?cat_cox_bayes_joint
.
cat_cox_bayes_joint_model <- cat_cox_bayes_joint( formula = survival::Surv(time, status) ~ ., # Same as `~ . cat_init = cat_init, chains = 1, # Default: 4 iter = 100, # Default: 2000 warmup = 50, # Default: 1000 tau_alpha = 2, # Default: 2 tau_gamma = 1, # Default: 1 hazard_beta = 2 # Default: 2 ) cat_cox_bayes_joint_model
Here shows how users can simplify the input for cat_cox_bayes_joint
. User do not have to specify chains
and other attributes, as they have default values, which mentioned above. Here we assign lower value to chains
, iter
and warmup
for quicker processing time.
User can also get the traceplot of the rstan
model by using traceplot()
directly into the output from cat_cox_bayes_joint
.
traceplot(cat_cox_bayes_joint_model)
Like the traceplot
shown in the cat_cox_bayes
function , user can add parameters used in rstan::traceplot
, like include
and inc_warmup
.
traceplot(cat_cox_bayes_joint_model, inc_warmup = TRUE)
Let's check the prediction score.
cat( "Catalytic `cat_cox_bayes_joint` - Predition Score:", get_cox_partial_likelihood( X = test_x, time = test_time, status = test_status, coefs = coef(cat_cox_bayes_joint_model), entry_point = test_entry_points ) - MPLE_0 )
Both cat_cox_bayes_joint_model
and cat_cox_bayes_joint_model
objects are outputs from the cat_cox_bayes_joint
function, providing a list of attributes for further analysis or user inspection.
Here's a breakdown of all attributes except the input parameters:
function_name
: The name of the function (cat_cox_bayes_joint
).
tau
: The estimated tau parameter from the MCMC sampling rstan::sampling
, depending on the model configuration.
stan_data
, stan_model
, stan_sample_model
, coefficients
and increment_cumulative_baseline_hazard
are same in section Bayesian Posterior Sampling with Fixed Tau.
For more details, please refer to ?cat_cox_bayes_joint
.
names(cat_cox_bayes_joint_model)
User can extract items mentioned above from cat_cox_bayes_joint
object.
# The estimated tau parameter from the MCMC sampling `rstan::sampling`, cat_cox_bayes_joint_model$tau # The mean estimated coefficients from the Bayesian model cat_cox_bayes_joint_model$coefficients
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.