cat_lmm_initialization: Initialization for Catalytic Linear Mixed Model (LMM)
In catalytic: Tools for Applying Catalytic Priors in Statistical Modeling

cat_lmm_initialization

R Documentation

Initialization for Catalytic Linear Mixed Model (LMM)

Description

This function prepares and initializes a catalytic linear mixed model by processing input data, extracting necessary variables, generating synthetic datasets, and fitting a model. (Only consider one random effect variance)

Usage

cat_lmm_initialization(
  formula,
  data,
  x_cols,
  y_col,
  z_cols,
  group_col = NULL,
  syn_size = NULL,
  resample_by_group = FALSE,
  resample_only = FALSE,
  na_replace = mean
)

Arguments

`formula`	A formula specifying the model. Should include response and predictor variables.
`data`	A data frame containing the data for modeling.
`x_cols`	A character vector of column names for fixed effects (predictors).
`y_col`	A character string for the name of the response variable.
`z_cols`	A character vector of column names for random effects.
`group_col`	A character string for the grouping variable (optional). If not given (NULL), it is extracted from the formula.
`syn_size`	An integer specifying the size of the synthetic dataset to be generated, default is length(x_cols) * 4.
`resample_by_group`	A logical indicating whether to resample by group, default is FALSE.
`resample_only`	A logical indicating whether to perform resampling only, default is FALSE.
`na_replace`	A function to replace NA values in the data, default is mean.

Value

A list containing the values of all the input arguments and the following components:

Function Information:
- function_name: A character string representing the name of the function, "cat_lmm_initialization".
- simple_model: An object of class lme4::lmer or stats::lm, representing the fitted model for generating synthetic response from the original data.
Observation Data Information:
- obs_size: An integer representing the number of observations in the original dataset.
- obs_data: The original data used for fitting the model, returned as a data frame.
- obs_x: A data frame containing the standardized predictor variables from the original dataset.
- obs_y: A numeric vector of the standardized response variable from the original dataset.
- obs_z: A data frame containing the standardized random effect variables from the original dataset.
- obs_group: A numeric vector representing the grouping variable for the original observations.
Synthetic Data Information:
- syn_size: An integer representing the number of synthetic observations generated.
- syn_data: A data frame containing the synthetic dataset, combining synthetic predictor and response variables.
- syn_x: A data frame containing the synthetic predictor variables.
- syn_y: A numeric vector of the synthetic response variable values.
- syn_z: A data frame containing the synthetic random effect variables.
- syn_group: A numeric vector representing the grouping variable for the synthetic observations.
- syn_x_resample_inform: A data frame containing information about the resampling process for synthetic predictors:
  - Coordinate: Preserves the original data values as reference coordinates during processing.
  - Deskewing: Adjusts the data distribution to reduce skewness and enhance symmetry.
  - Smoothing: Reduces noise in the data to stabilize the dataset and prevent overfitting.
  - Flattening: Creates a more uniform distribution by modifying low-frequency categories in categorical variables.
  - Symmetrizing: Balances the data around its mean to improve statistical properties for model fitting.
- syn_z_resample_inform: A data frame containing information about the resampling process for synthetic random effects. The resampling methods are the same as those from syn_x_resample_inform.
Whole Data Information:
- size: An integer representing the total size of the combined original and synthetic datasets.
- data: A combined data frame of the original and synthetic datasets.
- x: A combined data frame of the original and synthetic predictor variables.
- y: A combined numeric vector of the original and synthetic response variables.
- z: A combined data frame of the original and synthetic random effect variables.
- group: A combined numeric vector representing the grouping variable for both original and synthetic datasets.

Examples

data(mtcars)
cat_init <- cat_lmm_initialization(
  formula = mpg ~ wt + (1 | cyl), # formula for simple model
  data = mtcars,
  x_cols = c("wt"), # Fixed effects
  y_col = "mpg", # Response variable
  z_cols = c("disp", "hp", "drat", "qsec", "vs", "am", "gear", "carb"), # Random effects
  group_col = "cyl", # Grouping column
  syn_size = 100, # Synthetic data size
  resample_by_group = FALSE, # Resampling option
  resample_only = FALSE, # Resampling method
  na_replace = mean # NA replacement method
)
cat_init

catalytic documentation built on April 4, 2025, 5:51 a.m.