cat_glm_initialization: Initialization for Catalytic Generalized Linear Models (GLMs)

View source: R/cat_glm_initialization.R

cat_glm_initializationR Documentation

Initialization for Catalytic Generalized Linear Models (GLMs)

Description

This function prepares and initializes a catalytic Generalized Linear Models (GLMs) by processing input data, extracting necessary variables, generating synthetic datasets, and fitting a model.

Usage

cat_glm_initialization(
  formula,
  family = "gaussian",
  data,
  syn_size = NULL,
  custom_variance = NULL,
  gaussian_known_variance = FALSE,
  x_degree = NULL,
  resample_only = FALSE,
  na_replace = stats::na.omit
)

Arguments

formula

A formula specifying the GLMs. Should include response and predictor variables.

family

The type of GLM family. Defaults to Gaussian.

data

A data frame containing the data for modeling.

syn_size

An integer specifying the size of the synthetic dataset to be generated. Default is four times the number of predictor columns.

custom_variance

A custom variance value to be applied if using a Gaussian model. Defaults to NULL.

gaussian_known_variance

A logical value indicating whether the data variance is known. Defaults to FALSE. Only applicable to Gaussian family.

x_degree

A numeric vector indicating the degree for polynomial expansion of predictors. Default is 1 for each predictor.

resample_only

A logical indicating whether to perform resampling only. Default is FALSE.

na_replace

A function to handle NA values in the data. Default is stats::na.omit.

Value

A list containing the values of all the input arguments and the following components:

  • Function Information

    • function_name: The name of the function, "cat_glm_initialization".

    • y_col_name: The name of the response variable in the dataset.

    • simple_model: An object of class stats::glm, representing the fitted model for generating synthetic response from the original data.

  • Observation Data Information

    • obs_size: Number of observations in the original dataset.

    • obs_data: Data frame of standardized observation data.

    • obs_x: Predictor variables for observed data.

    • obs_y: Response variable for observed data.

  • Synthetic Data Information

    • syn_size: Number of synthetic observations generated.

    • syn_data: Data frame of synthetic predictor and response variables.

    • syn_x: Synthetic predictor variables.

    • syn_y: Synthetic response variable.

    • syn_x_resample_inform: Information about resampling methods for synthetic predictors:

      • Coordinate: Preserves the original data values as reference coordinates during processing.

      • Deskewing: Adjusts the data distribution to reduce skewness and enhance symmetry.

      • Smoothing: Reduces noise in the data to stabilize the dataset and prevent overfitting.

      • Flattening: Creates a more uniform distribution by modifying low-frequency categories in categorical variables.

      • Symmetrizing: Balances the data around its mean to improve statistical properties for model fitting.

  • Whole Data Information

    • size: Total number of combined original and synthetic observations.

    • data: Data frame combining original and synthetic datasets.

    • x: Combined predictor variables from original and synthetic data.

    • y: Combined response variable from original and synthetic data.

Examples

gaussian_data <- data.frame(
  X1 = stats::rnorm(10),
  X2 = stats::rnorm(10),
  Y = stats::rnorm(10)
)

cat_init <- cat_glm_initialization(
  formula = Y ~ 1, # formula for simple model
  data = gaussian_data,
  syn_size = 100, # Synthetic data size
  custom_variance = NULL, # User customized variance value
  gaussian_known_variance = TRUE, # Indicating whether the data variance is known
  x_degree = c(1, 1), # Degrees for polynomial expansion of predictors
  resample_only = FALSE, # Whether to perform resampling only
  na_replace = stats::na.omit # How to handle NA values in data
)
cat_init

catalytic documentation built on April 4, 2025, 5:51 a.m.