cat_glm_initialization: Initialization for Catalytic Generalized Linear Models (GLMs)
In catalytic: Tools for Applying Catalytic Priors in Statistical Modeling

cat_glm_initialization

R Documentation

Initialization for Catalytic Generalized Linear Models (GLMs)

Description

This function prepares and initializes a catalytic Generalized Linear Models (GLMs) by processing input data, extracting necessary variables, generating synthetic datasets, and fitting a model.

Usage

cat_glm_initialization(
  formula,
  family = "gaussian",
  data,
  syn_size = NULL,
  custom_variance = NULL,
  gaussian_known_variance = FALSE,
  x_degree = NULL,
  resample_only = FALSE,
  na_replace = stats::na.omit
)

Arguments

`formula`	A formula specifying the GLMs. Should include response and predictor variables.
`family`	The type of GLM family. Defaults to Gaussian.
`data`	A data frame containing the data for modeling.
`syn_size`	An integer specifying the size of the synthetic dataset to be generated. Default is four times the number of predictor columns.
`custom_variance`	A custom variance value to be applied if using a Gaussian model. Defaults to `NULL`.
`gaussian_known_variance`	A logical value indicating whether the data variance is known. Defaults to `FALSE`. Only applicable to Gaussian family.
`x_degree`	A numeric vector indicating the degree for polynomial expansion of predictors. Default is 1 for each predictor.
`resample_only`	A logical indicating whether to perform resampling only. Default is FALSE.
`na_replace`	A function to handle NA values in the data. Default is `stats::na.omit`.

Value

A list containing the values of all the input arguments and the following components:

Function Information
- function_name: The name of the function, "cat_glm_initialization".
- y_col_name: The name of the response variable in the dataset.
- simple_model: An object of class stats::glm, representing the fitted model for generating synthetic response from the original data.
Observation Data Information
- obs_size: Number of observations in the original dataset.
- obs_data: Data frame of standardized observation data.
- obs_x: Predictor variables for observed data.
- obs_y: Response variable for observed data.
Synthetic Data Information
- syn_size: Number of synthetic observations generated.
- syn_data: Data frame of synthetic predictor and response variables.
- syn_x: Synthetic predictor variables.
- syn_y: Synthetic response variable.
- syn_x_resample_inform: Information about resampling methods for synthetic predictors:
  - Coordinate: Preserves the original data values as reference coordinates during processing.
  - Deskewing: Adjusts the data distribution to reduce skewness and enhance symmetry.
  - Smoothing: Reduces noise in the data to stabilize the dataset and prevent overfitting.
  - Flattening: Creates a more uniform distribution by modifying low-frequency categories in categorical variables.
  - Symmetrizing: Balances the data around its mean to improve statistical properties for model fitting.
Whole Data Information
- size: Total number of combined original and synthetic observations.
- data: Data frame combining original and synthetic datasets.
- x: Combined predictor variables from original and synthetic data.
- y: Combined response variable from original and synthetic data.

Examples

gaussian_data <- data.frame(
  X1 = stats::rnorm(10),
  X2 = stats::rnorm(10),
  Y = stats::rnorm(10)
)

cat_init <- cat_glm_initialization(
  formula = Y ~ 1, # formula for simple model
  data = gaussian_data,
  syn_size = 100, # Synthetic data size
  custom_variance = NULL, # User customized variance value
  gaussian_known_variance = TRUE, # Indicating whether the data variance is known
  x_degree = c(1, 1), # Degrees for polynomial expansion of predictors
  resample_only = FALSE, # Whether to perform resampling only
  na_replace = stats::na.omit # How to handle NA values in data
)
cat_init

catalytic documentation built on April 4, 2025, 5:51 a.m.