cat_lmm_initialization: Initialization for Catalytic Linear Mixed Model (LMM)

View source: R/cat_lmm_initialization.R

cat_lmm_initializationR Documentation

Initialization for Catalytic Linear Mixed Model (LMM)

Description

This function prepares and initializes a catalytic linear mixed model by processing input data, extracting necessary variables, generating synthetic datasets, and fitting a model. (Only consider one random effect variance)

Usage

cat_lmm_initialization(
  formula,
  data,
  x_cols,
  y_col,
  z_cols,
  group_col = NULL,
  syn_size = NULL,
  resample_by_group = FALSE,
  resample_only = FALSE,
  na_replace = mean
)

Arguments

formula

A formula specifying the model. Should include response and predictor variables.

data

A data frame containing the data for modeling.

x_cols

A character vector of column names for fixed effects (predictors).

y_col

A character string for the name of the response variable.

z_cols

A character vector of column names for random effects.

group_col

A character string for the grouping variable (optional). If not given (NULL), it is extracted from the formula.

syn_size

An integer specifying the size of the synthetic dataset to be generated, default is length(x_cols) * 4.

resample_by_group

A logical indicating whether to resample by group, default is FALSE.

resample_only

A logical indicating whether to perform resampling only, default is FALSE.

na_replace

A function to replace NA values in the data, default is mean.

Value

A list containing the values of all the input arguments and the following components:

  • Function Information:

    • function_name: A character string representing the name of the function, "cat_lmm_initialization".

    • simple_model: An object of class lme4::lmer or stats::lm, representing the fitted model for generating synthetic response from the original data.

  • Observation Data Information:

    • obs_size: An integer representing the number of observations in the original dataset.

    • obs_data: The original data used for fitting the model, returned as a data frame.

    • obs_x: A data frame containing the standardized predictor variables from the original dataset.

    • obs_y: A numeric vector of the standardized response variable from the original dataset.

    • obs_z: A data frame containing the standardized random effect variables from the original dataset.

    • obs_group: A numeric vector representing the grouping variable for the original observations.

  • Synthetic Data Information:

    • syn_size: An integer representing the number of synthetic observations generated.

    • syn_data: A data frame containing the synthetic dataset, combining synthetic predictor and response variables.

    • syn_x: A data frame containing the synthetic predictor variables.

    • syn_y: A numeric vector of the synthetic response variable values.

    • syn_z: A data frame containing the synthetic random effect variables.

    • syn_group: A numeric vector representing the grouping variable for the synthetic observations.

    • syn_x_resample_inform: A data frame containing information about the resampling process for synthetic predictors:

      • Coordinate: Preserves the original data values as reference coordinates during processing.

      • Deskewing: Adjusts the data distribution to reduce skewness and enhance symmetry.

      • Smoothing: Reduces noise in the data to stabilize the dataset and prevent overfitting.

      • Flattening: Creates a more uniform distribution by modifying low-frequency categories in categorical variables.

      • Symmetrizing: Balances the data around its mean to improve statistical properties for model fitting.

    • syn_z_resample_inform: A data frame containing information about the resampling process for synthetic random effects. The resampling methods are the same as those from syn_x_resample_inform.

  • Whole Data Information:

    • size: An integer representing the total size of the combined original and synthetic datasets.

    • data: A combined data frame of the original and synthetic datasets.

    • x: A combined data frame of the original and synthetic predictor variables.

    • y: A combined numeric vector of the original and synthetic response variables.

    • z: A combined data frame of the original and synthetic random effect variables.

    • group: A combined numeric vector representing the grouping variable for both original and synthetic datasets.

Examples

data(mtcars)
cat_init <- cat_lmm_initialization(
  formula = mpg ~ wt + (1 | cyl), # formula for simple model
  data = mtcars,
  x_cols = c("wt"), # Fixed effects
  y_col = "mpg", # Response variable
  z_cols = c("disp", "hp", "drat", "qsec", "vs", "am", "gear", "carb"), # Random effects
  group_col = "cyl", # Grouping column
  syn_size = 100, # Synthetic data size
  resample_by_group = FALSE, # Resampling option
  resample_only = FALSE, # Resampling method
  na_replace = mean # NA replacement method
)
cat_init

catalytic documentation built on April 4, 2025, 5:51 a.m.