Adding New Data-Generating Mechanisms"
In PublicationBiasBenchmark: Benchmark for Publication Bias Correction Methods

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette explains how to add new data-generating mechanisms (DGMs) to the PublicationBiasBenchmark package. In the following, we will use the no_bias DGM as an example. (See the Using Presimulated Datasets vignette for details on working with the already stored simulated datasets.)

Overview

Each DGM in the package consists of three key components:

Main DGM function: Implements the data-generating mechanism
Validation function: Validates input parameters and settings
Conditions function: Defines pre-specified conditions

All three functions must be implemented in a single file named dgm-{DGM_NAME}.R in the R/ directory. Implementation of these three functions allows users to generate data from the DGM via the simulate_dgm() function.

File Structure and Naming

For a DGM called "no_bias", you need to create a file named R/dgm-no_bias.R containing three functions:

dgm.no_bias(): The main data-generating mechanism implementation
validate_dgm_setting.no_bias(): Parameter validation
dgm_conditions.no_bias(): Pre-defined conditions

The naming pattern is crucial for the package's S3 method dispatch system to work correctly.

1. Main DGM Function: `dgm.{DGM_NAME}()`

This is the core function that implements your data-generating mechanism. Here is the no_bias implementation as an example:

#' @title Normal Unbiased Data-Generating Mechanism
#'
#' @description
#' An example data-generating mechanism to simulate effect sizes without
#' publication bias.
#'
#' @param dgm_name DGM name (automatically passed)
#' @param settings List containing \describe{
#'   \item{mean_effect}{Mean effect}
#'   \item{heterogeneity}{Effect heterogeneity}
#'   \item{n_studies}{Number of effect size estimates}
#' }
#'
#'
#' @return Data frame with \describe{
#'   \item{yi}{effect size}
#'   \item{sei}{standard error}
#' }
#'
#' @references
#' \insertAllCited{}
#'
#' @seealso [dgm()], [validate_dgm_setting()]
#' @export
dgm.no_bias <- function(dgm_name, settings) {

  # Extract settings
  n_studies     <- settings[["n_studies"]]
  mean_effect   <- settings[["mean_effect"]]
  heterogeneity <- settings[["heterogeneity"]]

  # Simulate sample sizes based on empirical distribution
  N_shape <- 2
  N_scale <- 58
  N_low   <- 25
  N_high  <- 500

  N_seq <- seq(N_low, N_high, 1)
  N_den <- stats::dnbinom(N_seq, size = N_shape, prob = 1/(N_scale+1)) /
      (stats::pnbinom(N_high, size = N_shape, prob = 1/(N_scale+1)) - 
       stats::pnbinom(N_low - 1, size = N_shape, prob = 1/(N_scale+1)))

  N <- sample(N_seq, n_studies, TRUE, N_den)

  # Compute standard errors based on sample sizes (Cohen's d formula)
  standard_errors <- sqrt(4/N)

  # Simulate true effect sizes with heterogeneity
  effect_sizes <- stats::rnorm(n_studies, mean_effect, 
                              sqrt(heterogeneity^2 + standard_errors^2))

  # Return standardized data frame
  data <- data.frame(
    yi  = effect_sizes,
    sei = standard_errors,
    ni  = N
  )

  return(data)
}

Key Requirements for the Main Function:

Input Parameters:

dgm_name: Automatically passed by the framework
settings: Named list containing all DGM parameters or the condition_id value

Output: Must return a data frame with these required columns:

yi: Effect sizes
sei: Standard errors
ni: Sample sizes
es_type: Type of effect size (e.g., "SMD", "logOR", "none")

Optional additional columns (commonly used):

study_id: Unique identifier for each study/cluster (in the presence of multilevel/clustered data)

2. Validation Function: `validate_dgm_setting.{DGM_NAME}()`

This function validates that all required parameters are provided and have valid values:

#' @export
validate_dgm_setting.no_bias <- function(dgm_name, settings) {

  # Check that all required settings are specified
  required_params <- c("n_studies", "mean_effect", "heterogeneity")
  missing_params <- setdiff(required_params, names(settings))
  if (length(missing_params) > 0)
    stop("Missing required settings: ", paste(missing_params, collapse = ", "))

  # Extract settings for validation
  n_studies     <- settings[["n_studies"]]
  mean_effect   <- settings[["mean_effect"]]
  heterogeneity <- settings[["heterogeneity"]]

  # Validate each parameter
  if (length(n_studies) != 1 || !is.numeric(n_studies) || is.na(n_studies) || 
      !is.wholenumber(n_studies) || n_studies < 1)
    stop("'n_studies' must be an integer larger than 0")

  if (length(mean_effect) != 1 || !is.numeric(mean_effect) || is.na(mean_effect))
    stop("'mean_effect' must be numeric")

  if (length(heterogeneity) != 1 || !is.numeric(heterogeneity) || 
      is.na(heterogeneity) || heterogeneity < 0)
    stop("'heterogeneity' must be non-negative")

  return(invisible(TRUE))
}

Key Points for Validation:

Check for missing required parameters
Validate parameter types (numeric, integer, character, etc.)
Check parameter ranges and constraints
Provide clear, informative error messages
Return invisible(TRUE) on successful validation
Use stop() for validation failures

3. Conditions Function: `dgm_conditions.{DGM_NAME}()`

This function defines pre-specified conditions for benchmarking studies:

#' @export
dgm_conditions.no_bias <- function(dgm_name) {

  # Generate a grid of pre-specified settings
  settings <- data.frame(expand.grid(
    mean_effect    = c(0, 0.3),
    heterogeneity  = c(0, 0.15),
    n_studies      = c(10, 100)
  ))

  # Attach unique condition identifiers
  settings$condition_id <- 1:nrow(settings)

  return(settings)
}

Always add a condition_id column with unique identifiers. This column is used for generating data from the pre-defined conditions.

Once defined, these settings cannot be changed retrospectively to ensure reproducibility and continuity of the benchmark.

Using Your New DGM

Once implemented, your DGM can be used through a unified interface:

# Use with custom settings
data <- simulate_dgm("no_bias", list(
  mean_effect = 0.2,
  heterogeneity = 0.1,
  n_studies = 50
))
head(data)

# Use with pre-defined conditions
data <- simulate_dgm("no_bias", settings = 1)
head(data)

# View available conditions
conditions <- dgm_conditions("no_bias")
conditions