R/api-simulation.R

Defines functions plot.mfrm_signal_detection signal_detection_metric_label summary.mfrm_signal_detection evaluate_mfrm_signal_detection signal_eval_metric_col signal_eval_summary signal_eval_false_positive_rate signal_eval_get_p signal_eval_find_bias_row signal_eval_find_dif_row signal_eval_resolve_level recommend_mfrm_design plot.mfrm_design_evaluation summary.mfrm_design_evaluation evaluate_mfrm_design design_eval_summarize_results design_eval_build_notes design_eval_match_metric design_eval_recovery_metrics simulation_build_ademp simulation_recovery_contract simulation_resolve_fit_step_facet simulation_threshold_summary simulation_mcse_proportion simulation_mcse_mean design_eval_safe_sd design_eval_safe_mean design_eval_extract_truth simulation_apply_effects simulation_normalize_effects simulation_generate_skeleton_assignment simulation_generate_resampled_assignment simulation_sample_empirical_support simulation_center_numeric simulate_mfrm_data

Documented in evaluate_mfrm_design evaluate_mfrm_signal_detection plot.mfrm_design_evaluation plot.mfrm_signal_detection recommend_mfrm_design simulate_mfrm_data summary.mfrm_design_evaluation summary.mfrm_signal_detection

#' Simulate long-format many-facet Rasch data for design studies
#'
#' @param n_person Number of persons/respondents.
#' @param n_rater Number of rater facet levels.
#' @param n_criterion Number of criterion/item facet levels.
#' @param raters_per_person Number of raters assigned to each person.
#' @param score_levels Number of ordered score categories.
#' @param theta_sd Standard deviation of simulated person measures.
#' @param rater_sd Standard deviation of simulated rater severities.
#' @param criterion_sd Standard deviation of simulated criterion difficulties.
#' @param noise_sd Optional observation-level noise added to the linear predictor.
#' @param step_span Spread of step thresholds on the logit scale.
#' @param group_levels Optional character vector of group labels. When supplied,
#'   a balanced `Group` column is added to the simulated data.
#' @param dif_effects Optional data.frame describing true group-linked DIF
#'   effects. Must include `Group`, at least one design column such as
#'   `Criterion`, and numeric `Effect`.
#' @param interaction_effects Optional data.frame describing true non-group
#'   interaction effects. Must include at least one design column such as
#'   `Rater` or `Criterion`, plus numeric `Effect`.
#' @param seed Optional random seed.
#' @param model Measurement model recorded in the simulation setup.
#' @param step_facet Step facet used when `model = "PCM"` and threshold values
#'   vary across levels. Currently `"Criterion"` and `"Rater"` are supported.
#' @param thresholds Optional threshold specification. Use either a numeric
#'   vector of common thresholds or a data frame with columns `StepFacet`,
#'   `Step`/`StepIndex`, and `Estimate`.
#' @param assignment Assignment design. `"crossed"` means every person sees
#'   every rater; `"rotating"` uses a balanced rotating subset; `"resampled"`
#'   reuses person-level rater-assignment profiles stored in `sim_spec`;
#'   `"skeleton"` reuses an observed response skeleton stored in `sim_spec`,
#'   including optional `Group`/`Weight` columns when available. When omitted,
#'   the function chooses `"crossed"` if
#'   `raters_per_person == n_rater`, otherwise `"rotating"`.
#' @param sim_spec Optional output from [build_mfrm_sim_spec()] or
#'   [extract_mfrm_sim_spec()]. When supplied, it defines the generator setup;
#'   direct scalar arguments are treated as legacy inputs and should generally
#'   be left at their defaults except for `seed`.
#'
#' @details
#' This function generates synthetic MFRM data from the Rasch model.
#' The data-generating process is:
#'
#' 1. Draw person abilities: \eqn{\theta_n \sim N(0, \texttt{theta\_sd}^2)}
#' 2. Draw rater severities: \eqn{\delta_j \sim N(0, \texttt{rater\_sd}^2)}
#' 3. Draw criterion difficulties: \eqn{\beta_i \sim N(0, \texttt{criterion\_sd}^2)}
#' 4. Generate evenly-spaced step thresholds spanning \eqn{\pm}\code{step_span/2}
#' 5. For each observation, compute the linear predictor
#'    \eqn{\eta = \theta_n - \delta_j - \beta_i + \epsilon} where
#'    \eqn{\epsilon \sim N(0, \texttt{noise\_sd}^2)} (optional)
#' 6. Compute category probabilities under the recorded measurement model
#'    (`RSM` or `PCM`) and sample the response
#'
#' Latent-value generation is explicit:
#' - `latent_distribution = "normal"` draws centered normal person/rater/
#'   criterion values using the supplied standard deviations
#' - `latent_distribution = "empirical"` resamples centered support values
#'   recorded in `sim_spec$empirical_support`
#'
#' When `dif_effects` is supplied, the specified logit shift is added to
#' \eqn{\eta} for the focal group on the target facet level, creating a
#' known DIF signal.  Similarly, `interaction_effects` injects a known
#' bias into specific facet-level combinations.
#'
#' The generator targets the common two-facet rating design (persons
#' \eqn{\times} raters \eqn{\times} criteria).  `raters_per_person`
#' controls the incomplete-block structure: when less than `n_rater`,
#' each person is assigned a rotating subset of raters to keep coverage
#' balanced and reproducible.
#'
#' Threshold handling is intentionally explicit:
#' - if `thresholds = NULL`, common equally spaced thresholds are generated
#'   from `step_span`
#' - if `thresholds` is a numeric vector, it is used as one common threshold set
#' - if `thresholds` is a data frame, threshold values may vary by `StepFacet`
#'   (currently `Criterion` or `Rater`)
#'
#' Assignment handling is also explicit:
#' - `"crossed"` uses the full person x rater x criterion design
#' - `"rotating"` assigns a deterministic rotating subset of raters per person
#' - `"resampled"` reuses empirical person-level rater profiles stored in
#'   `sim_spec$assignment_profiles`, optionally carrying over person-level
#'   `Group`
#' - `"skeleton"` reuses an observed person-by-rater-by-criterion response
#'   skeleton stored in `sim_spec$design_skeleton`, optionally carrying over
#'   `Group` and `Weight`
#'
#' For more controlled workflows, build a reusable simulation specification
#' first via [build_mfrm_sim_spec()] or derive one from an observed fit with
#' [extract_mfrm_sim_spec()], then pass it through `sim_spec`.
#'
#' Returned data include attributes:
#' - `mfrm_truth`: simulated true parameters (for parameter-recovery checks)
#' - `mfrm_truth$signals`: injected DIF and interaction signal tables
#' - `mfrm_simulation_spec`: generation settings (for reproducibility)
#'
#' @section Interpreting output:
#' - Higher `theta` values in `mfrm_truth$person` indicate higher person measures.
#' - Higher values in `mfrm_truth$facets$Rater` indicate more severe raters.
#' - Higher values in `mfrm_truth$facets$Criterion` indicate more difficult criteria.
#' - `mfrm_truth$signals$dif_effects` and `mfrm_truth$signals$interaction_effects`
#'   record any injected detection targets.
#'
#' @section Typical workflow:
#' 1. Generate one design with `simulate_mfrm_data()`.
#' 2. Fit with [fit_mfrm()] and diagnose with [diagnose_mfrm()].
#' 3. For repeated design studies, use [evaluate_mfrm_design()].
#'
#' @return A long-format `data.frame` with core columns `Study`, `Person`,
#'   `Rater`, `Criterion`, and `Score`. If group labels are simulated or
#'   reused from an observed response skeleton, a `Group` column is included.
#'   If a weighted response skeleton is reused, a `Weight` column is also
#'   included.
#' @seealso [evaluate_mfrm_design()], [fit_mfrm()], [diagnose_mfrm()]
#' @examples
#' sim <- simulate_mfrm_data(
#'   n_person = 40,
#'   n_rater = 4,
#'   n_criterion = 4,
#'   raters_per_person = 2,
#'   seed = 123
#' )
#' head(sim)
#' names(attr(sim, "mfrm_truth"))
#' @export
simulate_mfrm_data <- function(n_person = 50,
                               n_rater = 4,
                               n_criterion = 4,
                               raters_per_person = n_rater,
                               score_levels = 4,
                               theta_sd = 1,
                               rater_sd = 0.35,
                               criterion_sd = 0.25,
                               noise_sd = 0,
                               step_span = 1.4,
                               group_levels = NULL,
                               dif_effects = NULL,
                               interaction_effects = NULL,
                               seed = NULL,
                               model = c("RSM", "PCM"),
                               step_facet = "Criterion",
                               thresholds = NULL,
                               assignment = NULL,
                               sim_spec = NULL) {
  if (!is.null(sim_spec)) {
    if (!inherits(sim_spec, "mfrm_sim_spec")) {
      stop("`sim_spec` must be output from build_mfrm_sim_spec() or extract_mfrm_sim_spec().", call. = FALSE)
    }
    n_person <- sim_spec$n_person
    n_rater <- sim_spec$n_rater
    n_criterion <- sim_spec$n_criterion
    raters_per_person <- sim_spec$raters_per_person
    score_levels <- sim_spec$score_levels
    theta_sd <- sim_spec$theta_sd
    rater_sd <- sim_spec$rater_sd
    criterion_sd <- sim_spec$criterion_sd
    noise_sd <- sim_spec$noise_sd
    step_span <- sim_spec$step_span
    group_levels <- sim_spec$group_levels
    dif_effects <- sim_spec$dif_effects
    interaction_effects <- sim_spec$interaction_effects
    model <- sim_spec$model
    step_facet <- sim_spec$step_facet
    thresholds <- sim_spec$threshold_table
    assignment <- sim_spec$assignment
    latent_distribution <- as.character(sim_spec$latent_distribution %||% "normal")
    empirical_support <- sim_spec$empirical_support %||% NULL
    assignment_profiles <- sim_spec$assignment_profiles %||% NULL
    design_skeleton <- sim_spec$design_skeleton %||% NULL
  } else {
    n_person <- as.integer(n_person[1])
    n_rater <- as.integer(n_rater[1])
    n_criterion <- as.integer(n_criterion[1])
    raters_per_person <- as.integer(raters_per_person[1])
    score_levels <- as.integer(score_levels[1])
    model <- match.arg(toupper(as.character(model[1])), c("RSM", "PCM"))
    step_facet <- as.character(step_facet[1] %||% "Criterion")
    assignment <- if (is.null(assignment)) {
      if (identical(raters_per_person, n_rater)) "crossed" else "rotating"
    } else {
      match.arg(tolower(as.character(assignment[1])), c("crossed", "rotating"))
    }
    thresholds <- simulation_build_threshold_table(
      thresholds = thresholds,
      score_levels = score_levels,
      step_span = as.numeric(step_span[1]),
      model = model
    )
    latent_distribution <- "normal"
    empirical_support <- NULL
    assignment_profiles <- NULL
    design_skeleton <- NULL
  }

  if (!is.finite(n_person) || n_person < 2L) stop("`n_person` must be >= 2.", call. = FALSE)
  if (!is.finite(n_rater) || n_rater < 2L) stop("`n_rater` must be >= 2.", call. = FALSE)
  if (!is.finite(n_criterion) || n_criterion < 2L) stop("`n_criterion` must be >= 2.", call. = FALSE)
  if (!is.finite(raters_per_person) || raters_per_person < 1L) stop("`raters_per_person` must be >= 1.", call. = FALSE)
  if (raters_per_person > n_rater) stop("`raters_per_person` cannot exceed `n_rater`.", call. = FALSE)
  if (!is.finite(score_levels) || score_levels < 2L) stop("`score_levels` must be >= 2.", call. = FALSE)
  if (!step_facet %in% c("Criterion", "Rater")) {
    stop("`step_facet` must currently be either \"Criterion\" or \"Rater\" for simulation.", call. = FALSE)
  }
  if (!identical(assignment, "crossed") && !identical(assignment, "rotating") &&
      !identical(assignment, "resampled") && !identical(assignment, "skeleton")) {
    stop("`assignment` must be one of \"crossed\", \"rotating\", \"resampled\", or \"skeleton\".", call. = FALSE)
  }
  if (identical(assignment, "crossed") && raters_per_person != n_rater) {
    stop("`assignment = \"crossed\"` requires `raters_per_person == n_rater`.", call. = FALSE)
  }
  if (!identical(latent_distribution, "normal") && !identical(latent_distribution, "empirical")) {
    stop("`latent_distribution` must be either \"normal\" or \"empirical\".", call. = FALSE)
  }
  if (identical(latent_distribution, "empirical") && is.null(empirical_support)) {
    stop("`latent_distribution = \"empirical\"` requires empirical support values in `sim_spec`.", call. = FALSE)
  }
  if (identical(assignment, "resampled") && is.null(assignment_profiles)) {
    stop("`assignment = \"resampled\"` requires assignment profiles in `sim_spec`.", call. = FALSE)
  }
  if (identical(assignment, "skeleton") && is.null(design_skeleton)) {
    stop("`assignment = \"skeleton\"` requires a design skeleton in `sim_spec`.", call. = FALSE)
  }

  if (!is.null(group_levels)) {
    group_levels <- as.character(group_levels)
    group_levels <- group_levels[!is.na(group_levels) & nzchar(group_levels)]
    group_levels <- unique(group_levels)
    if (length(group_levels) < 1L) stop("`group_levels` must contain at least one non-empty label.", call. = FALSE)
  }

  with_preserved_rng_seed(seed, {
    person_ids <- sprintf("P%03d", seq_len(n_person))
    rater_ids <- if (identical(assignment, "skeleton") && !is.null(design_skeleton)) {
      sort(unique(as.character(design_skeleton$Rater)))
    } else if (identical(assignment, "resampled") && !is.null(assignment_profiles)) {
      sort(unique(as.character(assignment_profiles$Rater)))
    } else {
      sprintf("R%02d", seq_len(n_rater))
    }
    criterion_ids <- if (identical(assignment, "skeleton") && !is.null(design_skeleton)) {
      sort(unique(as.character(design_skeleton$Criterion)))
    } else {
      sprintf("C%02d", seq_len(n_criterion))
    }

    theta <- stats::rnorm(n_person, mean = 0, sd = as.numeric(theta_sd[1]))
    names(theta) <- person_ids

    rater_effects <- sort(stats::rnorm(n_rater, mean = 0, sd = as.numeric(rater_sd[1])))
    names(rater_effects) <- rater_ids

    criterion_effects <- sort(stats::rnorm(n_criterion, mean = 0, sd = as.numeric(criterion_sd[1])))
    names(criterion_effects) <- criterion_ids

    if (identical(latent_distribution, "empirical")) {
      theta <- simulation_sample_empirical_support(empirical_support$person, n_person)
      names(theta) <- person_ids
      rater_effects <- simulation_sample_empirical_support(empirical_support$rater, n_rater)
      names(rater_effects) <- rater_ids
      criterion_effects <- simulation_sample_empirical_support(empirical_support$criterion, n_criterion)
      names(criterion_effects) <- criterion_ids
    }

    if (score_levels == 2L) {
      steps <- thresholds$Estimate[thresholds$StepFacet %in% c("Common", unique(thresholds$StepFacet)[1])][1]
    }

    if (assignment == "crossed") {
      dat <- expand.grid(
        Person = person_ids,
        Rater = rater_ids,
        Criterion = criterion_ids,
        stringsAsFactors = FALSE
      )
    } else if (assignment == "skeleton") {
      dat <- simulation_generate_skeleton_assignment(
        person_ids = person_ids,
        design_skeleton = design_skeleton
      )
    } else if (assignment == "resampled") {
      dat <- simulation_generate_resampled_assignment(
        person_ids = person_ids,
        criterion_ids = criterion_ids,
        assignment_profiles = assignment_profiles
      )
    } else {
      rows <- vector("list", length(person_ids))
      for (i in seq_along(person_ids)) {
        assigned <- rater_ids[((i - 1L) + seq_len(raters_per_person) - 1L) %% n_rater + 1L]
        rows[[i]] <- expand.grid(
          Person = person_ids[i],
          Rater = assigned,
          Criterion = criterion_ids,
          stringsAsFactors = FALSE
        )
      }
      dat <- dplyr::bind_rows(rows)
    }

    threshold_table <- tibble::as_tibble(thresholds) |>
      dplyr::mutate(
        StepFacet = as.character(.data$StepFacet),
        StepIndex = as.integer(.data$StepIndex),
        Step = as.character(.data$Step),
        Estimate = as.numeric(.data$Estimate)
      ) |>
      dplyr::arrange(.data$StepFacet, .data$StepIndex)
    threshold_lookup <- split(threshold_table$Estimate, threshold_table$StepFacet)
    if (!"Common" %in% names(threshold_lookup) && identical(model, "RSM")) {
      stop("RSM simulation requires one common threshold set.", call. = FALSE)
    }

    if ("Group" %in% names(dat)) {
      group_assign_tbl <- dat |>
        dplyr::distinct(.data$Person, .data$Group)
      group_assign <- stats::setNames(as.character(group_assign_tbl$Group),
                                      as.character(group_assign_tbl$Person))
    } else if (!is.null(group_levels)) {
      group_assign <- rep(group_levels, length.out = n_person)
      group_assign <- sample(group_assign, size = n_person, replace = FALSE)
      names(group_assign) <- person_ids
      dat$Group <- unname(group_assign[dat$Person])
    } else {
      group_assign <- NULL
    }

    allowed_effect_cols <- intersect(c("Group", "Person", "Rater", "Criterion"), names(dat))
    dif_effects <- simulation_normalize_effects(
      effects = dif_effects,
      arg_name = "dif_effects",
      allowed_cols = allowed_effect_cols
    )
    if (nrow(dif_effects) > 0) {
      if (!"Group" %in% names(dif_effects)) {
        stop("`dif_effects` must include a `Group` column.", call. = FALSE)
      }
      if (!any(c("Person", "Rater", "Criterion") %in% names(dif_effects))) {
        stop("`dif_effects` must include at least one of `Person`, `Rater`, or `Criterion`.", call. = FALSE)
      }
    }

    interaction_effects <- simulation_normalize_effects(
      effects = interaction_effects,
      arg_name = "interaction_effects",
      allowed_cols = allowed_effect_cols
    )
    if (nrow(interaction_effects) > 0 && !any(c("Person", "Rater", "Criterion") %in% names(interaction_effects))) {
      stop("`interaction_effects` must include at least one of `Person`, `Rater`, or `Criterion`.", call. = FALSE)
    }

    eta <- theta[dat$Person] - rater_effects[dat$Rater] - criterion_effects[dat$Criterion]
    eta <- eta + simulation_apply_effects(dat, dif_effects)
    eta <- eta + simulation_apply_effects(dat, interaction_effects)
    if (isTRUE(is.finite(noise_sd)) && as.numeric(noise_sd[1]) > 0) {
      eta <- eta + stats::rnorm(length(eta), mean = 0, sd = as.numeric(noise_sd[1]))
    }

    threshold_key <- if ("Common" %in% names(threshold_lookup)) {
      rep("Common", nrow(dat))
    } else {
      as.character(dat[[step_facet]])
    }
    if (!all(threshold_key %in% names(threshold_lookup))) {
      missing_keys <- setdiff(unique(threshold_key), names(threshold_lookup))
      stop(
        "Threshold specification is missing step-facet levels required for simulation: ",
        paste(missing_keys, collapse = ", "),
        ".",
        call. = FALSE
      )
    }
    if (identical(model, "PCM")) {
      step_levels <- if ("Common" %in% names(threshold_lookup)) {
        sort(unique(as.character(dat[[step_facet]])))
      } else {
        unique(as.character(threshold_table$StepFacet))
      }
      step_cum_mat <- t(vapply(step_levels, function(level) {
        step_vec <- if ("Common" %in% names(threshold_lookup)) {
          threshold_lookup[["Common"]]
        } else {
          threshold_lookup[[level]]
        }
        c(0, cumsum(step_vec))
      }, numeric(score_levels)))
      criterion_idx <- match(as.character(dat[[step_facet]]), step_levels)
      if (anyNA(criterion_idx)) {
        stop("PCM simulation could not align observations to step-facet thresholds.", call. = FALSE)
      }
      criterion_splits <- split(seq_along(criterion_idx), criterion_idx)
      prob_mat <- category_prob_pcm(
        eta = eta,
        step_cum_mat = step_cum_mat,
        criterion_idx = criterion_idx,
        criterion_splits = criterion_splits
      )
    } else {
      step_cum <- c(0, cumsum(threshold_lookup[["Common"]]))
      prob_mat <- category_prob_rsm(eta, step_cum)
    }
    dat$Score <- apply(prob_mat, 1, function(p) sample.int(score_levels, size = 1L, prob = p))
    dat$Study <- "SimulatedDesign"
    keep_cols <- c("Study", "Person", "Rater", "Criterion", "Score")
    if ("Group" %in% names(dat)) keep_cols <- c(keep_cols, "Group")
    if ("Weight" %in% names(dat)) keep_cols <- c(keep_cols, "Weight")
    dat <- dat[, keep_cols]

    attr(dat, "mfrm_truth") <- list(
      person = theta,
      facets = list(
        Rater = rater_effects,
        Criterion = criterion_effects
      ),
      steps = if ("Common" %in% names(threshold_lookup)) threshold_lookup$Common else threshold_table,
      step_table = threshold_table,
      groups = group_assign,
      signals = list(
        dif_effects = dif_effects,
        interaction_effects = interaction_effects
      )
    )
    attr(dat, "mfrm_simulation_spec") <- list(
      n_person = n_person,
      n_rater = n_rater,
      n_criterion = n_criterion,
      raters_per_person = raters_per_person,
      score_levels = score_levels,
      theta_sd = theta_sd,
      rater_sd = rater_sd,
      criterion_sd = criterion_sd,
      noise_sd = noise_sd,
      step_span = step_span,
      group_levels = group_levels,
      model = model,
      step_facet = step_facet,
      assignment = assignment,
      latent_distribution = latent_distribution,
      threshold_table = threshold_table,
      design_skeleton = design_skeleton
    )
    dat
  })
}

simulation_center_numeric <- function(x) {
  x <- suppressWarnings(as.numeric(x))
  x <- x[is.finite(x)]
  if (length(x) == 0) return(numeric(0))
  x - mean(x)
}

simulation_sample_empirical_support <- function(values, size) {
  centered <- simulation_center_numeric(values)
  if (length(centered) < 2L) {
    stop("Empirical latent support must contain at least two finite values.", call. = FALSE)
  }
  sample(centered, size = size, replace = TRUE)
}

simulation_generate_resampled_assignment <- function(person_ids,
                                                     criterion_ids,
                                                     assignment_profiles) {
  if (!is.data.frame(assignment_profiles) || nrow(assignment_profiles) == 0) {
    stop("Empirical assignment profiles are required for `assignment = \"resampled\"`.",
         call. = FALSE)
  }

  profiles <- tibble::as_tibble(assignment_profiles) |>
    dplyr::distinct(.data$TemplatePerson, .data$Rater) |>
    dplyr::arrange(.data$TemplatePerson, .data$Rater)
  group_map <- NULL
  if ("Group" %in% names(assignment_profiles)) {
    group_map <- tibble::as_tibble(assignment_profiles) |>
      dplyr::filter(!is.na(.data$Group), nzchar(.data$Group)) |>
      dplyr::distinct(.data$TemplatePerson, .data$Group)
  }
  template_people <- unique(profiles$TemplatePerson)
  sampled_templates <- sample(template_people, size = length(person_ids), replace = TRUE)

  rows <- vector("list", length(person_ids))
  for (i in seq_along(person_ids)) {
    assigned <- profiles |>
      dplyr::filter(.data$TemplatePerson == sampled_templates[i]) |>
      dplyr::pull(.data$Rater)
    rows[[i]] <- expand.grid(
      Person = person_ids[i],
      Rater = assigned,
      Criterion = criterion_ids,
      stringsAsFactors = FALSE
    )
    if (!is.null(group_map)) {
      matched_group <- group_map$Group[group_map$TemplatePerson == sampled_templates[i]][1]
      if (is.character(matched_group) && nzchar(matched_group)) {
        rows[[i]]$Group <- matched_group
      }
    }
  }
  dplyr::bind_rows(rows)
}

simulation_generate_skeleton_assignment <- function(person_ids, design_skeleton) {
  if (!is.data.frame(design_skeleton) || nrow(design_skeleton) == 0) {
    stop("Observed design skeleton is required for `assignment = \"skeleton\"`.",
         call. = FALSE)
  }

  skeleton <- tibble::as_tibble(design_skeleton) |>
    dplyr::distinct()
  template_people <- unique(skeleton$TemplatePerson)
  sampled_templates <- sample(template_people, size = length(person_ids), replace = TRUE)
  keep_group <- "Group" %in% names(skeleton)
  keep_weight <- "Weight" %in% names(skeleton)

  rows <- vector("list", length(person_ids))
  for (i in seq_along(person_ids)) {
    template_rows <- skeleton |>
      dplyr::filter(.data$TemplatePerson == sampled_templates[i]) |>
      dplyr::mutate(Person = person_ids[i], .before = 1)
    select_cols <- c("Person", "Rater", "Criterion")
    if (keep_group) select_cols <- c(select_cols, "Group")
    if (keep_weight) select_cols <- c(select_cols, "Weight")
    rows[[i]] <- dplyr::select(template_rows, dplyr::all_of(select_cols))
  }
  dplyr::bind_rows(rows)
}

simulation_normalize_effects <- function(effects, arg_name, allowed_cols) {
  if (is.null(effects)) return(tibble::tibble())
  if (!is.data.frame(effects)) {
    stop("`", arg_name, "` must be a data.frame with an `Effect` column.", call. = FALSE)
  }
  eff <- tibble::as_tibble(effects)
  if (nrow(eff) == 0L && ncol(eff) == 0L) {
    return(tibble::tibble())
  }
  if (!"Effect" %in% names(eff)) {
    stop("`", arg_name, "` must include an `Effect` column.", call. = FALSE)
  }
  unknown <- setdiff(names(eff), c(allowed_cols, "Effect"))
  if (length(unknown) > 0) {
    stop("`", arg_name, "` includes unsupported columns: ", paste(unknown, collapse = ", "), call. = FALSE)
  }
  key_cols <- setdiff(names(eff), "Effect")
  if (length(key_cols) == 0) {
    stop("`", arg_name, "` must include at least one matching design column besides `Effect`.", call. = FALSE)
  }
  eff$Effect <- suppressWarnings(as.numeric(eff$Effect))
  if (any(!is.finite(eff$Effect))) {
    stop("`", arg_name, "` has non-finite values in `Effect`.", call. = FALSE)
  }
  eff
}

simulation_apply_effects <- function(dat, effects) {
  if (is.null(effects) || !is.data.frame(effects) || nrow(effects) == 0) {
    return(rep(0, nrow(dat)))
  }
  key_cols <- setdiff(names(effects), "Effect")
  adj <- numeric(nrow(dat))
  for (i in seq_len(nrow(effects))) {
    mask <- rep(TRUE, nrow(dat))
    for (col in key_cols) {
      value <- effects[[col]][i]
      if (is.na(value) || !nzchar(as.character(value))) next
      mask <- mask & as.character(dat[[col]]) == as.character(value)
    }
    adj[mask] <- adj[mask] + as.numeric(effects$Effect[i])
  }
  adj
}

design_eval_extract_truth <- function(truth, facet) {
  if (is.null(truth) || !is.list(truth)) return(NULL)
  if (identical(facet, "Person")) return(truth$person)
  if (is.null(truth$facets) || !facet %in% names(truth$facets)) return(NULL)
  truth$facets[[facet]]
}

design_eval_safe_mean <- function(x) {
  x <- suppressWarnings(as.numeric(x))
  if (length(x) == 0 || all(is.na(x))) return(NA_real_)
  mean(x, na.rm = TRUE)
}

design_eval_safe_sd <- function(x) {
  x <- suppressWarnings(as.numeric(x))
  x <- x[is.finite(x)]
  if (length(x) <= 1L) return(NA_real_)
  stats::sd(x)
}

simulation_mcse_mean <- function(x) {
  x <- suppressWarnings(as.numeric(x))
  x <- x[is.finite(x)]
  n <- length(x)
  if (n <= 1L) return(NA_real_)
  stats::sd(x) / sqrt(n)
}

simulation_mcse_proportion <- function(x) {
  x <- suppressWarnings(as.numeric(x))
  x <- x[!is.na(x)]
  n <- length(x)
  if (n <= 1L) return(NA_real_)
  p <- mean(x, na.rm = TRUE)
  sqrt(p * (1 - p) / n)
}

simulation_threshold_summary <- function(sim_spec) {
  if (is.null(sim_spec)) {
    return(list(mode = "generated_common", step_facet = NA_character_))
  }
  list(
    mode = simulation_spec_threshold_mode(sim_spec),
    step_facet = as.character(sim_spec$step_facet %||% NA_character_)
  )
}

simulation_resolve_fit_step_facet <- function(model, step_facet, generator_step_facet) {
  if (!identical(model, "PCM")) {
    return(NA_character_)
  }
  explicit <- as.character(step_facet[1] %||% NA_character_)
  if (is.na(explicit) || !nzchar(explicit)) {
    inherited <- as.character(generator_step_facet[1] %||% NA_character_)
    if (!is.na(inherited) && nzchar(inherited)) {
      return(inherited)
    }
    return("Criterion")
  }
  explicit
}

simulation_recovery_contract <- function(generator_model,
                                         generator_step_facet,
                                         fitted_model,
                                         fitted_step_facet) {
  same_model <- identical(as.character(generator_model), as.character(fitted_model))
  if (!same_model) {
    return(list(
      comparable = FALSE,
      basis = "generator_fit_model_mismatch"
    ))
  }
  if (identical(as.character(fitted_model), "PCM")) {
    same_step_facet <- identical(
      as.character(generator_step_facet %||% NA_character_),
      as.character(fitted_step_facet %||% NA_character_)
    )
    if (!same_step_facet) {
      return(list(
        comparable = FALSE,
        basis = "generator_fit_step_facet_mismatch"
      ))
    }
  }
  list(
    comparable = TRUE,
    basis = "generator_fit_contract_aligned"
  )
}

simulation_build_ademp <- function(purpose,
                                   design_grid,
                                   generator_model,
                                   generator_step_facet,
                                   generator_assignment,
                                   sim_spec = NULL,
                                   estimands,
                                   analysis_methods,
                                   performance_measures) {
  threshold_info <- simulation_threshold_summary(sim_spec)
  list(
    aims = purpose,
    data_generating_mechanism = list(
      source = if (is.null(sim_spec)) "scalar_arguments" else as.character(sim_spec$source %||% "mfrm_sim_spec"),
      model = generator_model,
      step_facet = generator_step_facet,
      assignment = generator_assignment,
      latent_distribution = if (is.null(sim_spec)) "normal" else as.character(sim_spec$latent_distribution %||% "normal"),
      threshold_mode = threshold_info$mode,
      threshold_step_facet = threshold_info$step_facet,
      design_variables = names(design_grid)[names(design_grid) %in% c("n_person", "n_rater", "n_criterion", "raters_per_person")]
    ),
    estimands = estimands,
    methods = analysis_methods,
    performance_measures = performance_measures
  )
}

design_eval_recovery_metrics <- function(est_levels, est_values, truth_vec) {
  idx <- match(as.character(est_levels), names(truth_vec))
  ok <- is.finite(idx) & is.finite(est_values)
  if (!any(ok)) {
    return(list(
      raw_rmse = NA_real_,
      raw_bias = NA_real_,
      aligned_rmse = NA_real_,
      aligned_bias = NA_real_
    ))
  }

  truth_vals <- as.numeric(truth_vec[idx[ok]])
  est_vals <- as.numeric(est_values[ok])
  diffs <- est_vals - truth_vals
  shift <- mean(diffs, na.rm = TRUE)
  aligned_diffs <- diffs - shift

  list(
    raw_rmse = sqrt(mean(diffs^2, na.rm = TRUE)),
    raw_bias = mean(diffs, na.rm = TRUE),
    aligned_rmse = sqrt(mean(aligned_diffs^2, na.rm = TRUE)),
    aligned_bias = mean(aligned_diffs, na.rm = TRUE)
  )
}

design_eval_match_metric <- function(metric) {
  switch(
    metric,
    separation = "MeanSeparation",
    reliability = "MeanReliability",
    infit = "MeanInfit",
    outfit = "MeanOutfit",
    misfitrate = "MeanMisfitRate",
    severityrmse = "MeanSeverityRMSE",
    severitybias = "MeanSeverityBias",
    convergencerate = "ConvergenceRate",
    elapsedsec = "MeanElapsedSec",
    mincategorycount = "MeanMinCategoryCount",
    stop("Unknown metric: ", metric, call. = FALSE)
  )
}

design_eval_build_notes <- function(summary_tbl) {
  notes <- character(0)
  if (!is.data.frame(summary_tbl) || nrow(summary_tbl) == 0) return(notes)
  if (any(summary_tbl$ConvergenceRate < 1, na.rm = TRUE)) {
    notes <- c(notes, "Some design conditions did not converge in every replication.")
  }
  if (any(summary_tbl$MeanMinCategoryCount < 10, na.rm = TRUE)) {
    notes <- c(notes, "Some design conditions produced sparse score categories (< 10 observations).")
  }
  if (any(summary_tbl$MeanSeparation < 2, na.rm = TRUE)) {
    notes <- c(notes, "Some design conditions yielded low facet separation (< 2.0).")
  }
  if (any(grepl("^Mcse", names(summary_tbl)))) {
    notes <- c(notes, "MCSE columns summarize finite-replication uncertainty around the reported means and rates.")
  }
  if ("RecoveryComparableRate" %in% names(summary_tbl) &&
      any(summary_tbl$RecoveryComparableRate < 1, na.rm = TRUE)) {
    notes <- c(
      notes,
      "Recovery metrics are reported only for design rows where generator and fitted model contracts align; rows with generator-fit mismatches set recovery fields to NA."
    )
  }
  notes
}

design_eval_summarize_results <- function(results, rep_overview) {
  results_tbl <- tibble::as_tibble(results)
  rep_tbl <- tibble::as_tibble(rep_overview)

  design_summary <- tibble::tibble()
  if (nrow(results_tbl) > 0) {
    design_summary <- results_tbl |>
      dplyr::group_by(
        .data$design_id,
        .data$Facet,
        .data$n_person,
        .data$n_rater,
        .data$n_criterion,
        .data$raters_per_person
      ) |>
      dplyr::summarize(
        Reps = dplyr::n(),
        ConvergenceRate = mean(.data$Converged, na.rm = TRUE),
        McseConvergenceRate = simulation_mcse_proportion(.data$Converged),
        MeanSeparation = mean(.data$Separation, na.rm = TRUE),
        SdSeparation = design_eval_safe_sd(.data$Separation),
        McseSeparation = simulation_mcse_mean(.data$Separation),
        MeanReliability = mean(.data$Reliability, na.rm = TRUE),
        McseReliability = simulation_mcse_mean(.data$Reliability),
        MeanInfit = mean(.data$MeanInfit, na.rm = TRUE),
        McseInfit = simulation_mcse_mean(.data$MeanInfit),
        MeanOutfit = mean(.data$MeanOutfit, na.rm = TRUE),
        McseOutfit = simulation_mcse_mean(.data$MeanOutfit),
        MeanMisfitRate = mean(.data$MisfitRate, na.rm = TRUE),
        McseMisfitRate = simulation_mcse_mean(.data$MisfitRate),
        MeanSeverityRMSE = mean(.data$SeverityRMSE, na.rm = TRUE),
        McseSeverityRMSE = simulation_mcse_mean(.data$SeverityRMSE),
        MeanSeverityBias = mean(.data$SeverityBias, na.rm = TRUE),
        McseSeverityBias = simulation_mcse_mean(.data$SeverityBias),
        MeanSeverityRMSERaw = mean(.data$SeverityRMSERaw, na.rm = TRUE),
        McseSeverityRMSERaw = simulation_mcse_mean(.data$SeverityRMSERaw),
        MeanSeverityBiasRaw = mean(.data$SeverityBiasRaw, na.rm = TRUE),
        McseSeverityBiasRaw = simulation_mcse_mean(.data$SeverityBiasRaw),
        RecoveryComparableRate = mean(.data$RecoveryComparable, na.rm = TRUE),
        RecoveryBasis = paste(sort(unique(as.character(.data$RecoveryBasis))), collapse = "; "),
        MeanElapsedSec = mean(.data$ElapsedSec, na.rm = TRUE),
        McseElapsedSec = simulation_mcse_mean(.data$ElapsedSec),
        MeanMinCategoryCount = mean(.data$MinCategoryCount, na.rm = TRUE),
        McseMinCategoryCount = simulation_mcse_mean(.data$MinCategoryCount),
        .groups = "drop"
      ) |>
      dplyr::arrange(.data$Facet, .data$n_person, .data$n_rater, .data$n_criterion, .data$raters_per_person)
  }

  overview <- tibble::tibble(
    Designs = dplyr::n_distinct(rep_tbl$design_id),
    Replications = nrow(rep_tbl),
    SuccessfulRuns = sum(rep_tbl$RunOK, na.rm = TRUE),
    ConvergedRuns = sum(rep_tbl$Converged, na.rm = TRUE),
    MeanElapsedSec = design_eval_safe_mean(rep_tbl$ElapsedSec)
  )

  list(
    overview = overview,
    design_summary = design_summary,
    notes = design_eval_build_notes(design_summary)
  )
}

#' Evaluate MFRM design conditions by repeated simulation
#'
#' @param n_person Vector of person counts to evaluate.
#' @param n_rater Vector of rater counts to evaluate.
#' @param n_criterion Vector of criterion counts to evaluate.
#' @param raters_per_person Vector of rater assignments per person.
#' @param reps Number of replications per design condition.
#' @param score_levels Number of ordered score categories.
#' @param theta_sd Standard deviation of simulated person measures.
#' @param rater_sd Standard deviation of simulated rater severities.
#' @param criterion_sd Standard deviation of simulated criterion difficulties.
#' @param noise_sd Optional observation-level noise added to the linear predictor.
#' @param step_span Spread of step thresholds on the logit scale.
#' @param fit_method Estimation method passed to [fit_mfrm()].
#' @param model Measurement model passed to [fit_mfrm()].
#' @param step_facet Step facet passed to [fit_mfrm()] when `model = "PCM"`.
#'   When left `NULL`, the function inherits the generator step facet from
#'   `sim_spec` when available and otherwise defaults to `"Criterion"`.
#' @param maxit Maximum iterations passed to [fit_mfrm()].
#' @param quad_points Quadrature points for `fit_method = "MML"`.
#' @param residual_pca Residual PCA mode passed to [diagnose_mfrm()].
#' @param sim_spec Optional output from [build_mfrm_sim_spec()] or
#'   [extract_mfrm_sim_spec()] used as the base data-generating mechanism.
#'   When supplied, the design grid still varies `n_person`, `n_rater`,
#'   `n_criterion`, and `raters_per_person`, but latent-spread assumptions,
#'   thresholds, and other generator settings come from `sim_spec`.
#'   If `sim_spec` contains step-facet-specific thresholds, the design grid may
#'   not vary the number of levels for that step facet away from the
#'   specification.
#' @param seed Optional seed for reproducible replications.
#'
#' @details
#' This helper runs a compact Monte Carlo design study for common rater-by-item
#' many-facet settings.
#'
#' For each design condition, the function:
#' 1. generates synthetic data with [simulate_mfrm_data()]
#' 2. fits the requested MFRM with [fit_mfrm()]
#' 3. computes diagnostics with [diagnose_mfrm()]
#' 4. stores recovery and precision summaries by facet
#'
#' The result is intended for planning questions such as:
#' - how many raters are needed for stable rater separation?
#' - how does `raters_per_person` affect severity recovery?
#' - when do category counts become too sparse for comfortable interpretation?
#'
#' This is a **parametric simulation study**. It does not take one observed
#' design (for example, 4 raters x 30 persons x 3 criteria) and analytically
#' extrapolate what would happen under a different design (for example,
#' 2 raters x 40 persons x 5 criteria). Instead, you specify a design grid and
#' data-generating assumptions (latent spread, facet spread, thresholds, noise,
#' and scoring structure), and the function repeatedly generates synthetic data
#' under those assumptions.
#'
#' When you want the simulated conditions to resemble an existing study, use
#' substantive knowledge or estimates from that study to choose
#' `theta_sd`, `rater_sd`, `criterion_sd`, `score_levels`, and related
#' settings before running the design evaluation.
#'
#' When `sim_spec` is supplied, the function uses it as the explicit
#' data-generating mechanism. This is the recommended route when you want a
#' design study to stay close to a previously fitted run while still varying the
#' candidate sample sizes or rater-assignment counts.
#'
#' Recovery metrics are reported only when the generator and fitted model target
#' the same facet-parameter contract. In practice this means the same
#' `model`, and for `PCM`, the same `step_facet`. When these do not align,
#' recovery fields are set to `NA` and the output records the reason.
#'
#' @section Reported metrics:
#' Facet-level simulation results include:
#' - `Separation` (\eqn{G = \mathrm{SD_{adj}} / \mathrm{RMSE}}):
#'   how many statistically distinct strata the facet resolves.
#' - `Reliability` (\eqn{G^2 / (1 + G^2)}): analogous to Cronbach's
#'   \eqn{\alpha} for the reproducibility of element ordering.
#' - `Strata` (\eqn{(4G + 1) / 3}): number of distinguishable groups.
#' - Mean `Infit` and `Outfit`: average fit mean-squares across elements.
#' - `MisfitRate`: share of elements with \eqn{|\mathrm{ZSTD}| > 2}.
#' - `SeverityRMSE`: root-mean-square error of recovered parameters vs
#'   the known truth **after facet-wise mean alignment**, so that the
#'   usual Rasch/MFRM location indeterminacy does not inflate recovery
#'   error. This quantity is reported only when the generator and fitted model
#'   target the same facet-parameter contract.
#' - `SeverityBias`: mean signed recovery error after the same alignment;
#'   values near zero are expected. This is likewise omitted when the
#'   generator/fitted-model contract does not align.
#'
#' @section Interpreting output:
#' Start with `summary(x)$design_summary`, then plot one focal metric at a time
#' (for example rater `Separation` or criterion `SeverityRMSE`).
#'
#' Higher separation/reliability is generally better, whereas lower
#' `SeverityRMSE`, `MeanMisfitRate`, and `MeanElapsedSec` are preferable.
#'
#' When choosing among designs, look for the point where increasing
#' `n_person` or `raters_per_person` yields diminishing returns in
#' separation and RMSE---this identifies the cost-effective design
#' frontier.  `ConvergedRuns / reps` should be near 1.0; low
#' convergence rates indicate the design is too small for the chosen
#' estimation method.
#'
#' @section References:
#' The simulation logic follows the general Monte Carlo / operating-characteristic
#' framework described by Morris, White, and Crowther (2019) and the
#' ADEMP-oriented planning/reporting guidance summarized for psychology by
#' Siepe et al. (2024). In `mfrmr`, `evaluate_mfrm_design()` is a practical
#' many-facet design-planning wrapper rather than a direct reproduction of one
#' published simulation study.
#'
#' - Morris, T. P., White, I. R., & Crowther, M. J. (2019).
#'   *Using simulation studies to evaluate statistical methods*.
#'   Statistics in Medicine, 38(11), 2074-2102.
#' - Siepe, B. S., Bartos, F., Morris, T. P., Boulesteix, A.-L., Heck, D. W.,
#'   & Pawel, S. (2024). *Simulation studies for methodological research in
#'   psychology: A standardized template for planning, preregistration, and
#'   reporting*. Psychological Methods.
#'
#' @return An object of class `mfrm_design_evaluation` with components:
#' - `design_grid`: evaluated design conditions
#' - `results`: facet-level replicate results
#' - `rep_overview`: run-level status and timing
#' - `settings`: simulation settings
#' - `ademp`: simulation-study metadata (aims, DGM, estimands, methods, performance measures)
#' @seealso [simulate_mfrm_data()], [summary.mfrm_design_evaluation], [plot.mfrm_design_evaluation]
#' @examples
#' \donttest{
#' sim_eval <- evaluate_mfrm_design(
#'   n_person = c(30, 50),
#'   n_rater = 4,
#'   n_criterion = 4,
#'   raters_per_person = 2,
#'   reps = 1,
#'   maxit = 15,
#'   seed = 123
#' )
#' s_eval <- summary(sim_eval)
#' s_eval$design_summary[, c("Facet", "n_person", "MeanSeparation", "MeanSeverityRMSE")]
#' p_eval <- plot(sim_eval, facet = "Rater", metric = "separation", x_var = "n_person", draw = FALSE)
#' names(p_eval)
#' }
#' @export
evaluate_mfrm_design <- function(n_person = c(30, 50, 100),
                                 n_rater = c(3, 5),
                                 n_criterion = c(3, 5),
                                 raters_per_person = n_rater,
                                 reps = 10,
                                 score_levels = 4,
                                 theta_sd = 1,
                                 rater_sd = 0.35,
                                 criterion_sd = 0.25,
                                 noise_sd = 0,
                                 step_span = 1.4,
                                 fit_method = c("JML", "MML"),
                                 model = c("RSM", "PCM"),
                                 step_facet = NULL,
                                 maxit = 25,
                                 quad_points = 7,
                                 residual_pca = c("none", "overall", "facet", "both"),
                                 sim_spec = NULL,
                                 seed = NULL) {
  fit_method <- match.arg(fit_method)
  model <- match.arg(model)
  residual_pca <- match.arg(residual_pca)
  if (!is.null(sim_spec) && !inherits(sim_spec, "mfrm_sim_spec")) {
    stop("`sim_spec` must be output from build_mfrm_sim_spec() or extract_mfrm_sim_spec().", call. = FALSE)
  }
  reps <- as.integer(reps[1])
  if (!is.finite(reps) || reps < 1L) stop("`reps` must be >= 1.", call. = FALSE)

  design_grid <- expand.grid(
    n_person = as.integer(n_person),
    n_rater = as.integer(n_rater),
    n_criterion = as.integer(n_criterion),
    raters_per_person = as.integer(raters_per_person),
    KEEP.OUT.ATTRS = FALSE,
    stringsAsFactors = FALSE
  )
  design_grid <- design_grid[design_grid$raters_per_person <= design_grid$n_rater, , drop = FALSE]
  if (nrow(design_grid) == 0) {
    stop("No valid design rows remain after enforcing `raters_per_person <= n_rater`.", call. = FALSE)
  }
  design_grid$design_id <- sprintf("D%02d", seq_len(nrow(design_grid)))
  design_grid <- design_grid[, c("design_id", "n_person", "n_rater", "n_criterion", "raters_per_person")]

  generator_model <- if (is.null(sim_spec)) model else sim_spec$model
  generator_step_facet <- if (is.null(sim_spec)) if (identical(generator_model, "PCM")) "Criterion" else NA_character_ else sim_spec$step_facet
  generator_assignment <- if (is.null(sim_spec)) "design_dependent" else sim_spec$assignment
  fit_step_facet <- simulation_resolve_fit_step_facet(model, step_facet, generator_step_facet)
  recovery_contract <- simulation_recovery_contract(
    generator_model = generator_model,
    generator_step_facet = generator_step_facet,
    fitted_model = model,
    fitted_step_facet = fit_step_facet
  )

  seeds <- with_preserved_rng_seed(
    seed,
    sample.int(.Machine$integer.max, size = nrow(design_grid) * reps, replace = FALSE)
  )
  seed_idx <- 0L

  result_rows <- vector("list", nrow(design_grid) * reps * 3L)
  rep_rows <- vector("list", nrow(design_grid) * reps)
  result_idx <- 0L
  rep_idx <- 0L

  for (i in seq_len(nrow(design_grid))) {
    design <- design_grid[i, , drop = FALSE]
    for (rep in seq_len(reps)) {
      seed_idx <- seed_idx + 1L
      row_spec <- if (is.null(sim_spec)) {
        NULL
      } else {
        simulation_override_spec_design(
          sim_spec,
          n_person = design$n_person,
          n_rater = design$n_rater,
          n_criterion = design$n_criterion,
          raters_per_person = design$raters_per_person
        )
      }
      row_score_levels <- if (is.null(row_spec)) score_levels else row_spec$score_levels
      sim <- if (is.null(row_spec)) {
        simulate_mfrm_data(
          n_person = design$n_person,
          n_rater = design$n_rater,
          n_criterion = design$n_criterion,
          raters_per_person = design$raters_per_person,
          score_levels = score_levels,
          theta_sd = theta_sd,
          rater_sd = rater_sd,
          criterion_sd = criterion_sd,
          noise_sd = noise_sd,
          step_span = step_span,
          seed = seeds[seed_idx]
        )
      } else {
        simulate_mfrm_data(sim_spec = row_spec, seed = seeds[seed_idx])
      }

      t0 <- proc.time()[["elapsed"]]
      fit_args <- list(
        data = sim,
        person = "Person",
        facets = c("Rater", "Criterion"),
        score = "Score",
        method = fit_method,
        model = model,
        maxit = maxit
      )
      if (identical(model, "PCM")) fit_args$step_facet <- fit_step_facet
      if ("Weight" %in% names(sim)) fit_args$weight <- "Weight"
      if (identical(fit_method, "MML")) fit_args$quad_points <- quad_points

      fit <- tryCatch(do.call(fit_mfrm, fit_args), error = function(e) e)
      diag <- if (inherits(fit, "error")) fit else {
        tryCatch(
          diagnose_mfrm(fit, residual_pca = residual_pca),
          error = function(e) e
        )
      }
      elapsed <- proc.time()[["elapsed"]] - t0
      truth <- attr(sim, "mfrm_truth")

      rep_idx <- rep_idx + 1L
      rep_row <- tibble::tibble(
        design_id = design$design_id,
        rep = rep,
        n_person = design$n_person,
        n_rater = design$n_rater,
        n_criterion = design$n_criterion,
        raters_per_person = design$raters_per_person,
        Observations = nrow(sim),
        MinCategoryCount = min(tabulate(sim$Score, nbins = row_score_levels)),
        ElapsedSec = elapsed,
        RunOK = FALSE,
        Converged = FALSE,
        Error = NA_character_
      )

      if (inherits(fit, "error")) {
        rep_row$Error <- conditionMessage(fit)
        rep_rows[[rep_idx]] <- rep_row
        next
      }
      if (inherits(diag, "error")) {
        rep_row$Error <- conditionMessage(diag)
        rep_rows[[rep_idx]] <- rep_row
        next
      }

      converged <- isTRUE(as.logical(fit$summary$Converged[1]))
      rep_row$RunOK <- TRUE
      rep_row$Converged <- converged
      rep_rows[[rep_idx]] <- rep_row

      reliability_tbl <- tibble::as_tibble(diag$reliability)
      fit_tbl <- tibble::as_tibble(diag$fit)
      measure_tbl <- tibble::as_tibble(diag$measures)

      for (facet in reliability_tbl$Facet) {
        rel_row <- reliability_tbl[reliability_tbl$Facet == facet, , drop = FALSE]
        facet_fit <- fit_tbl[fit_tbl$Facet == facet, , drop = FALSE]
        facet_meas <- measure_tbl[measure_tbl$Facet == facet, , drop = FALSE]
        truth_vec <- design_eval_extract_truth(truth, facet)

        severity_rmse <- NA_real_
        severity_bias <- NA_real_
        severity_rmse_raw <- NA_real_
        severity_bias_raw <- NA_real_
        if (isTRUE(recovery_contract$comparable) &&
            !is.null(truth_vec) && nrow(facet_meas) > 0 &&
            "Level" %in% names(facet_meas) && "Estimate" %in% names(facet_meas)) {
          recovery <- design_eval_recovery_metrics(
            est_levels = facet_meas$Level,
            est_values = suppressWarnings(as.numeric(facet_meas$Estimate)),
            truth_vec = truth_vec
          )
          severity_rmse <- recovery$aligned_rmse
          severity_bias <- recovery$aligned_bias
          severity_rmse_raw <- recovery$raw_rmse
          severity_bias_raw <- recovery$raw_bias
        }

        misfit_rate <- NA_real_
        if (nrow(facet_fit) > 0) {
          z_in <- suppressWarnings(as.numeric(facet_fit$InfitZSTD))
          z_out <- suppressWarnings(as.numeric(facet_fit$OutfitZSTD))
          misfit_rate <- mean(abs(z_in) > 2 | abs(z_out) > 2, na.rm = TRUE)
        }

        result_idx <- result_idx + 1L
        result_rows[[result_idx]] <- tibble::tibble(
          design_id = design$design_id,
          rep = rep,
          Facet = facet,
          n_person = design$n_person,
          n_rater = design$n_rater,
          n_criterion = design$n_criterion,
          raters_per_person = design$raters_per_person,
          Observations = nrow(sim),
          MinCategoryCount = min(tabulate(sim$Score, nbins = row_score_levels)),
          ElapsedSec = elapsed,
          Converged = converged,
          GeneratorModel = generator_model,
          GeneratorStepFacet = generator_step_facet,
          FitModel = model,
          FitStepFacet = fit_step_facet,
          RecoveryComparable = recovery_contract$comparable,
          RecoveryBasis = recovery_contract$basis,
          Levels = suppressWarnings(as.integer(rel_row$Levels[1])),
          Separation = suppressWarnings(as.numeric(rel_row$Separation[1])),
          Strata = suppressWarnings(as.numeric(rel_row$Strata[1])),
          Reliability = suppressWarnings(as.numeric(rel_row$Reliability[1])),
          MeanInfit = suppressWarnings(as.numeric(rel_row$MeanInfit[1])),
          MeanOutfit = suppressWarnings(as.numeric(rel_row$MeanOutfit[1])),
          MisfitRate = misfit_rate,
          SeverityRMSE = severity_rmse,
          SeverityBias = severity_bias,
          SeverityRMSERaw = severity_rmse_raw,
          SeverityBiasRaw = severity_bias_raw
        )
      }
    }
  }

  results <- dplyr::bind_rows(result_rows[seq_len(result_idx)])
  rep_overview <- dplyr::bind_rows(rep_rows[seq_len(rep_idx)])
  ademp <- simulation_build_ademp(
    purpose = "Assess many-facet design conditions via repeated parametric simulation under explicit data-generating assumptions.",
    design_grid = design_grid,
    generator_model = generator_model,
    generator_step_facet = generator_step_facet,
    generator_assignment = generator_assignment,
    sim_spec = sim_spec,
    estimands = c(
      "Facet separation, reliability, and strata",
      "Mean infit/outfit and misfit rate",
      "Aligned facet-parameter recovery RMSE and bias",
      "Convergence rate and elapsed time"
    ),
    analysis_methods = list(
      fit_method = fit_method,
      fitted_model = model,
      maxit = maxit,
      quad_points = if (identical(fit_method, "MML")) quad_points else NA_integer_,
      residual_pca = residual_pca
    ),
    performance_measures = c(
      "Mean performance across replications",
      "MCSE for means and rates",
      "Convergence rate",
      "Sparse-category warning rate"
    )
  )

  structure(
    list(
      design_grid = design_grid,
      results = results,
      rep_overview = rep_overview,
      settings = list(
        reps = reps,
        score_levels = score_levels,
        theta_sd = theta_sd,
        rater_sd = rater_sd,
        criterion_sd = criterion_sd,
        noise_sd = noise_sd,
        step_span = step_span,
        fit_method = fit_method,
        model = model,
        step_facet = fit_step_facet,
        maxit = maxit,
        quad_points = quad_points,
        residual_pca = residual_pca,
        sim_spec = sim_spec,
        generator_model = generator_model,
        generator_step_facet = generator_step_facet,
        generator_assignment = generator_assignment,
        recovery_comparable = recovery_contract$comparable,
        recovery_basis = recovery_contract$basis,
        seed = seed
      ),
      ademp = ademp
    ),
    class = "mfrm_design_evaluation"
  )
}

#' Summarize a design-simulation study
#'
#' @param object Output from [evaluate_mfrm_design()].
#' @param digits Number of digits used in the returned numeric summaries.
#' @param ... Reserved for generic compatibility.
#'
#' @details
#' The summary emphasizes condition-level averages that are useful for practical
#' design planning, especially:
#' - convergence rate
#' - separation and reliability by facet
#' - severity recovery RMSE
#' - mean misfit rate
#'
#' @return An object of class `summary.mfrm_design_evaluation` with components:
#' - `overview`: run-level overview
#' - `design_summary`: aggregated design-by-facet metrics
#' - `ademp`: simulation-study metadata carried forward from the original object
#' - `notes`: short interpretation notes
#' @seealso [evaluate_mfrm_design()], [plot.mfrm_design_evaluation]
#' @examples
#' \donttest{
#' sim_eval <- evaluate_mfrm_design(
#'   n_person = c(30, 50),
#'   n_rater = 4,
#'   n_criterion = 4,
#'   raters_per_person = 2,
#'   reps = 1,
#'   maxit = 15,
#'   seed = 123
#' )
#' summary(sim_eval)
#' }
#' @export
summary.mfrm_design_evaluation <- function(object, digits = 3, ...) {
  if (!is.list(object) || is.null(object$results) || is.null(object$rep_overview)) {
    stop("`object` must be output from evaluate_mfrm_design().")
  }
  digits <- max(0L, as.integer(digits[1]))
  out <- design_eval_summarize_results(object$results, object$rep_overview)

  round_df <- function(df) {
    if (!is.data.frame(df) || nrow(df) == 0) return(df)
    num_cols <- vapply(df, is.numeric, logical(1))
    df[num_cols] <- lapply(df[num_cols], round, digits = digits)
    df
  }

  out$overview <- round_df(out$overview)
  out$design_summary <- round_df(out$design_summary)
  out$ademp <- object$ademp %||% NULL
  class(out) <- "summary.mfrm_design_evaluation"
  out
}

#' Plot a design-simulation study
#'
#' @param x Output from [evaluate_mfrm_design()].
#' @param facet Facet to visualize.
#' @param metric Metric to plot.
#' @param x_var Design variable used on the x-axis.
#' @param group_var Optional design variable used for separate lines.
#' @param draw If `TRUE`, draw with base graphics; otherwise return plotting data.
#' @param ... Reserved for generic compatibility.
#'
#' @details
#' This method is designed for quick design-planning scans rather than polished
#' publication graphics.
#'
#' Useful first plots are:
#' - rater `metric = "separation"` against `x_var = "n_person"`
#' - criterion `metric = "severityrmse"` against `x_var = "n_person"`
#'   when you want aligned recovery error rather than raw location shifts
#' - rater `metric = "convergencerate"` against `x_var = "raters_per_person"`
#'
#' @return If `draw = TRUE`, invisibly returns a plotting-data list. If
#'   `draw = FALSE`, returns that list directly.
#' @seealso [evaluate_mfrm_design()], [summary.mfrm_design_evaluation]
#' @examples
#' \donttest{
#' sim_eval <- evaluate_mfrm_design(
#'   n_person = c(30, 50),
#'   n_rater = 4,
#'   n_criterion = 4,
#'   raters_per_person = 2,
#'   reps = 1,
#'   maxit = 15,
#'   seed = 123
#' )
#' plot(sim_eval, facet = "Rater", metric = "separation", x_var = "n_person", draw = FALSE)
#' }
#' @export
plot.mfrm_design_evaluation <- function(x,
                                        facet = c("Rater", "Criterion", "Person"),
                                        metric = c("separation", "reliability", "infit", "outfit",
                                                   "misfitrate", "severityrmse", "severitybias",
                                                   "convergencerate", "elapsedsec", "mincategorycount"),
                                        x_var = c("n_person", "n_rater", "n_criterion", "raters_per_person"),
                                        group_var = NULL,
                                        draw = TRUE,
                                        ...) {
  if (!is.list(x) || is.null(x$results) || is.null(x$rep_overview)) {
    stop("`x` must be output from evaluate_mfrm_design().")
  }
  facet <- match.arg(facet)
  metric <- match.arg(metric)
  x_var <- match.arg(x_var)

  sum_obj <- design_eval_summarize_results(x$results, x$rep_overview)
  plot_tbl <- tibble::as_tibble(sum_obj$design_summary)
  if (nrow(plot_tbl) == 0) stop("No design-summary rows available for plotting.")
  plot_tbl <- plot_tbl[plot_tbl$Facet == facet, , drop = FALSE]
  if (nrow(plot_tbl) == 0) stop("No rows available for facet `", facet, "`.")

  metric_col <- design_eval_match_metric(metric)
  varying <- c("n_person", "n_rater", "n_criterion", "raters_per_person")
  varying <- varying[varying != x_var]
  if (is.null(group_var)) {
    cand <- varying[vapply(plot_tbl[varying], function(col) length(unique(col)) > 1L, logical(1))]
    group_var <- if (length(cand) > 0) cand[1] else NULL
  } else {
    if (!group_var %in% c("n_person", "n_rater", "n_criterion", "raters_per_person")) {
      stop("`group_var` must be one of the evaluated design variables.")
    }
    if (identical(group_var, x_var)) {
      stop("`group_var` must differ from `x_var`.")
    }
  }

  if (is.null(group_var)) {
    agg_tbl <- plot_tbl |>
      dplyr::group_by(.data[[x_var]]) |>
      dplyr::summarize(y = mean(.data[[metric_col]], na.rm = TRUE), .groups = "drop") |>
      dplyr::arrange(.data[[x_var]]) |>
      dplyr::mutate(group = "All designs")
  } else {
    agg_tbl <- plot_tbl |>
      dplyr::group_by(.data[[x_var]], .data[[group_var]]) |>
      dplyr::summarize(y = mean(.data[[metric_col]], na.rm = TRUE), .groups = "drop") |>
      dplyr::arrange(.data[[x_var]], .data[[group_var]]) |>
      dplyr::rename(group = dplyr::all_of(group_var))
  }

  out <- list(
    plot = "design_evaluation",
    facet = facet,
    metric = metric,
    metric_col = metric_col,
    x_var = x_var,
    group_var = group_var,
    data = agg_tbl
  )

  if (!isTRUE(draw)) return(out)

  groups <- unique(as.character(agg_tbl$group))
  cols <- grDevices::hcl.colors(max(1L, length(groups)), "Set 2")
  x_vals <- sort(unique(agg_tbl[[x_var]]))
  y_range <- range(agg_tbl$y, na.rm = TRUE)
  if (!all(is.finite(y_range))) stop("Selected metric has no finite values to plot.")

  graphics::plot(
    x = x_vals,
    y = rep(NA_real_, length(x_vals)),
    type = "n",
    xlab = x_var,
    ylab = metric_col,
    main = paste("Design simulation:", facet, metric_col),
    ylim = y_range
  )
  for (i in seq_along(groups)) {
    sub <- agg_tbl[as.character(agg_tbl$group) == groups[i], , drop = FALSE]
    sub <- sub[order(sub[[x_var]]), , drop = FALSE]
    graphics::lines(sub[[x_var]], sub$y, type = "b", lwd = 2, pch = 16 + (i - 1L) %% 5L, col = cols[i])
  }
  if (length(groups) > 1L) {
    graphics::legend("topleft", legend = groups, col = cols, lty = 1, lwd = 2, pch = 16 + (seq_along(groups) - 1L) %% 5L, bty = "n")
  }

  invisible(out)
}

#' Recommend a design condition from simulation results
#'
#' @param x Output from [evaluate_mfrm_design()] or [summary.mfrm_design_evaluation()].
#' @param facets Facets that must satisfy the planning thresholds.
#' @param min_separation Minimum acceptable mean separation.
#' @param min_reliability Minimum acceptable mean reliability.
#' @param max_severity_rmse Maximum acceptable severity recovery RMSE.
#' @param max_misfit_rate Maximum acceptable mean misfit rate.
#' @param min_convergence_rate Minimum acceptable convergence rate.
#' @param prefer Ranking priority among design variables. Earlier entries are
#'   optimized first when multiple designs pass.
#'
#' @details
#' This helper converts a design-study summary into a simple planning table.
#'
#' A design is marked as recommended when all requested facets satisfy all
#' selected thresholds simultaneously.
#' If multiple designs pass, the helper returns the smallest one according to
#' `prefer` (by default: fewer persons first, then fewer ratings per person,
#' then fewer raters, then fewer criteria).
#'
#' @section Typical workflow:
#' 1. Run [evaluate_mfrm_design()].
#' 2. Review [summary.mfrm_design_evaluation()] and [plot.mfrm_design_evaluation()].
#' 3. Use `recommend_mfrm_design(...)` to identify the smallest acceptable design.
#'
#' @return A list of class `mfrm_design_recommendation` with:
#' - `facet_table`: facet-level threshold checks
#' - `design_table`: design-level aggregated checks
#' - `recommended`: the first passing design after ranking
#' - `thresholds`: thresholds used in the recommendation
#' @seealso [evaluate_mfrm_design()], [summary.mfrm_design_evaluation], [plot.mfrm_design_evaluation]
#' @examples
#' \donttest{
#' sim_eval <- evaluate_mfrm_design(
#'   n_person = c(30, 50),
#'   n_rater = 4,
#'   n_criterion = 4,
#'   raters_per_person = 2,
#'   reps = 1,
#'   maxit = 15,
#'   seed = 123
#' )
#' rec <- recommend_mfrm_design(sim_eval)
#' rec$recommended
#' }
#' @export
recommend_mfrm_design <- function(x,
                                  facets = c("Rater", "Criterion"),
                                  min_separation = 2,
                                  min_reliability = 0.8,
                                  max_severity_rmse = 0.5,
                                  max_misfit_rate = 0.10,
                                  min_convergence_rate = 1,
                                  prefer = c("n_person", "raters_per_person", "n_rater", "n_criterion")) {
  if (inherits(x, "mfrm_design_evaluation")) {
    design_summary <- summary.mfrm_design_evaluation(x, digits = 6)$design_summary
  } else if (inherits(x, "summary.mfrm_design_evaluation")) {
    design_summary <- x$design_summary
  } else {
    stop("`x` must be output from evaluate_mfrm_design() or summary.mfrm_design_evaluation().")
  }

  design_summary <- tibble::as_tibble(design_summary)
  if (nrow(design_summary) == 0) stop("No design summary rows available.")

  facets <- unique(as.character(facets))
  missing_facets <- setdiff(facets, unique(design_summary$Facet))
  if (length(missing_facets) > 0) {
    stop("Requested facets not found in the design summary: ", paste(missing_facets, collapse = ", "))
  }

  prefer <- intersect(prefer, c("n_person", "raters_per_person", "n_rater", "n_criterion"))
  if (length(prefer) == 0) {
    stop("`prefer` must contain at least one valid design variable.")
  }

  facet_table <- design_summary |>
    dplyr::filter(.data$Facet %in% facets) |>
    dplyr::mutate(
      SeparationPass = is.finite(.data$MeanSeparation) & .data$MeanSeparation >= min_separation,
      ReliabilityPass = is.finite(.data$MeanReliability) & .data$MeanReliability >= min_reliability,
      SeverityRMSEPass = is.finite(.data$MeanSeverityRMSE) & .data$MeanSeverityRMSE <= max_severity_rmse,
      MisfitRatePass = is.finite(.data$MeanMisfitRate) & .data$MeanMisfitRate <= max_misfit_rate,
      ConvergencePass = is.finite(.data$ConvergenceRate) & .data$ConvergenceRate >= min_convergence_rate,
      Pass = .data$SeparationPass & .data$ReliabilityPass & .data$SeverityRMSEPass &
        .data$MisfitRatePass & .data$ConvergencePass
    ) |>
    dplyr::arrange(.data$Facet, .data$n_person, .data$n_rater, .data$n_criterion, .data$raters_per_person)

  design_table <- facet_table |>
    dplyr::group_by(
      .data$design_id,
      .data$n_person,
      .data$n_rater,
      .data$n_criterion,
      .data$raters_per_person
    ) |>
    dplyr::summarize(
      FacetsChecked = paste(.data$Facet, collapse = ", "),
      MinSeparation = min(.data$MeanSeparation, na.rm = TRUE),
      MinReliability = min(.data$MeanReliability, na.rm = TRUE),
      MaxSeverityRMSE = max(.data$MeanSeverityRMSE, na.rm = TRUE),
      MaxMisfitRate = max(.data$MeanMisfitRate, na.rm = TRUE),
      MinConvergenceRate = min(.data$ConvergenceRate, na.rm = TRUE),
      FacetsPassing = sum(.data$Pass, na.rm = TRUE),
      FacetsRequired = dplyr::n(),
      Pass = all(.data$Pass),
      .groups = "drop"
    )

  rank_vars <- c("Pass", prefer)
  design_table <- design_table |>
    dplyr::arrange(dplyr::desc(.data$Pass), !!!rlang::syms(prefer))

  recommended <- design_table |>
    dplyr::filter(.data$Pass) |>
    dplyr::slice_head(n = 1)

  structure(
    list(
      facet_table = facet_table,
      design_table = design_table,
      recommended = recommended,
      thresholds = list(
        facets = facets,
        min_separation = min_separation,
        min_reliability = min_reliability,
        max_severity_rmse = max_severity_rmse,
        max_misfit_rate = max_misfit_rate,
        min_convergence_rate = min_convergence_rate,
        prefer = prefer
      )
    ),
    class = "mfrm_design_recommendation"
  )
}

signal_eval_resolve_level <- function(value, prefix, n_levels, arg_name) {
  if (is.null(value)) {
    return(sprintf("%s%02d", prefix, n_levels))
  }
  if (is.numeric(value)) {
    idx <- as.integer(value[1])
    if (!is.finite(idx) || idx < 1L || idx > n_levels) {
      stop("`", arg_name, "` must be between 1 and ", n_levels, ".")
    }
    return(sprintf("%s%02d", prefix, idx))
  }
  value <- as.character(value[1])
  if (!nzchar(value) || is.na(value)) {
    stop("`", arg_name, "` must be a non-empty label or index.")
  }
  value
}

signal_eval_find_dif_row <- function(tbl, level, group1, group2) {
  tbl <- tibble::as_tibble(tbl)
  if (nrow(tbl) == 0) return(tbl)
  out <- tbl |>
    dplyr::filter(
      .data$Level == level,
      (.data$Group1 == group1 & .data$Group2 == group2) |
        (.data$Group1 == group2 & .data$Group2 == group1)
    ) |>
    dplyr::slice_head(n = 1)
  out
}

signal_eval_find_bias_row <- function(tbl, rater_level, criterion_level) {
  tbl <- tibble::as_tibble(tbl)
  if (nrow(tbl) == 0) return(tbl)
  out <- tbl |>
    dplyr::filter(
      .data$FacetA_Level == rater_level,
      .data$FacetB_Level == criterion_level
    ) |>
    dplyr::slice_head(n = 1)
  out
}

signal_eval_get_p <- function(tbl, adjusted_col = "p_adjusted", raw_col = "p_value") {
  tbl <- tibble::as_tibble(tbl)
  if (nrow(tbl) == 0) return(NA_real_)
  p_adj <- suppressWarnings(as.numeric(tbl[[adjusted_col]][1]))
  if (is.finite(p_adj)) return(p_adj)
  p_raw <- suppressWarnings(as.numeric(tbl[[raw_col]][1]))
  if (is.finite(p_raw)) return(p_raw)
  NA_real_
}

signal_eval_false_positive_rate <- function(flag_vec) {
  if (length(flag_vec) == 0) return(NA_real_)
  mean(flag_vec, na.rm = TRUE)
}

signal_eval_summary <- function(results, rep_overview) {
  results_tbl <- tibble::as_tibble(results)
  rep_tbl <- tibble::as_tibble(rep_overview)

  detection_summary <- tibble::tibble()
  if (nrow(results_tbl) > 0) {
    detection_summary <- results_tbl |>
      dplyr::group_by(
        .data$design_id,
        .data$n_person,
        .data$n_rater,
        .data$n_criterion,
        .data$raters_per_person,
        .data$DIFTargetLevel,
        .data$BiasTargetRater,
        .data$BiasTargetCriterion
      ) |>
      dplyr::summarize(
        Reps = dplyr::n(),
        ConvergenceRate = mean(.data$Converged, na.rm = TRUE),
        McseConvergenceRate = simulation_mcse_proportion(.data$Converged),
        DIFPower = mean(.data$DIFDetected, na.rm = TRUE),
        McseDIFPower = simulation_mcse_proportion(.data$DIFDetected),
        DIFClassificationPower = mean(.data$DIFClassDetected, na.rm = TRUE),
        McseDIFClassificationPower = simulation_mcse_proportion(.data$DIFClassDetected),
        MeanTargetContrast = mean(.data$DIFContrast, na.rm = TRUE),
        McseTargetContrast = simulation_mcse_mean(.data$DIFContrast),
        MeanTargetContrastAbs = mean(abs(.data$DIFContrast), na.rm = TRUE),
        McseTargetContrastAbs = simulation_mcse_mean(abs(.data$DIFContrast)),
        DIFFalsePositiveRate = mean(.data$DIFFalsePositiveRate, na.rm = TRUE),
        McseDIFFalsePositiveRate = simulation_mcse_mean(.data$DIFFalsePositiveRate),
        BiasScreenRate = mean(.data$BiasDetected, na.rm = TRUE),
        McseBiasScreenRate = simulation_mcse_proportion(.data$BiasDetected),
        MeanTargetBias = mean(.data$BiasSize, na.rm = TRUE),
        McseTargetBias = simulation_mcse_mean(.data$BiasSize),
        MeanAbsTargetBias = mean(abs(.data$BiasSize), na.rm = TRUE),
        McseAbsTargetBias = simulation_mcse_mean(abs(.data$BiasSize)),
        MeanTargetBiasT = mean(.data$BiasT, na.rm = TRUE),
        McseTargetBiasT = simulation_mcse_mean(.data$BiasT),
        BiasScreenMetricAvailabilityRate = mean(.data$BiasScreenMetricAvailable, na.rm = TRUE),
        McseBiasScreenMetricAvailabilityRate = simulation_mcse_proportion(.data$BiasScreenMetricAvailable),
        BiasScreenFalsePositiveRate = mean(.data$BiasScreenFalsePositiveRate, na.rm = TRUE),
        McseBiasScreenFalsePositiveRate = simulation_mcse_mean(.data$BiasScreenFalsePositiveRate),
        MeanElapsedSec = mean(.data$ElapsedSec, na.rm = TRUE),
        McseElapsedSec = simulation_mcse_mean(.data$ElapsedSec),
        .groups = "drop"
      ) |>
      dplyr::arrange(.data$n_person, .data$n_rater, .data$n_criterion, .data$raters_per_person)
  }

  overview <- tibble::tibble(
    Designs = dplyr::n_distinct(rep_tbl$design_id),
    Replications = nrow(rep_tbl),
    SuccessfulRuns = sum(rep_tbl$RunOK, na.rm = TRUE),
    ConvergedRuns = sum(rep_tbl$Converged, na.rm = TRUE),
    MeanElapsedSec = design_eval_safe_mean(rep_tbl$ElapsedSec)
  )

  notes <- character(0)
  if (nrow(detection_summary) > 0 && any(detection_summary$ConvergenceRate < 1, na.rm = TRUE)) {
    notes <- c(notes, "Some design conditions did not converge in every replication.")
  }
  if (nrow(detection_summary) > 0 && any(detection_summary$DIFPower < 0.8, na.rm = TRUE)) {
    notes <- c(notes, "Some design conditions showed DIF power below 0.80.")
  }
  if (nrow(detection_summary) > 0 && any(detection_summary$BiasScreenRate < 0.8, na.rm = TRUE)) {
    notes <- c(notes, "Some design conditions showed bias-screen hit rates below 0.80.")
  }
  if (nrow(detection_summary) > 0 && any(detection_summary$BiasScreenMetricAvailabilityRate < 1, na.rm = TRUE)) {
    notes <- c(notes, "Some design conditions did not yield usable bias-screening t/p metrics in every replication.")
  }
  notes <- c(
    notes,
    "Bias-side rates are screening summaries derived from `estimate_bias()` output and should not be interpreted as formal power or alpha-calibrated false-positive rates.",
    "MCSE columns summarize finite-replication uncertainty around the reported means and rates."
  )

  list(
    overview = overview,
    detection_summary = detection_summary,
    notes = notes
  )
}

signal_eval_metric_col <- function(signal, metric) {
  if (identical(signal, "dif")) {
    switch(
      metric,
      power = "DIFPower",
      false_positive = "DIFFalsePositiveRate",
      estimate = "MeanTargetContrast"
    )
  } else {
    switch(
      metric,
      power = "BiasScreenRate",
      false_positive = "BiasScreenFalsePositiveRate",
      estimate = "MeanTargetBias"
    )
  }
}

#' Evaluate DIF power and bias-screening behavior under known simulated signals
#'
#' @param n_person Vector of person counts to evaluate.
#' @param n_rater Vector of rater counts to evaluate.
#' @param n_criterion Vector of criterion counts to evaluate.
#' @param raters_per_person Vector of rater assignments per person.
#' @param reps Number of replications per design condition.
#' @param group_levels Group labels used for DIF simulation. The first two levels
#'   define the default reference and focal groups.
#' @param reference_group Optional reference group label used when extracting the
#'   target DIF contrast.
#' @param focal_group Optional focal group label used when extracting the target
#'   DIF contrast.
#' @param dif_level Target criterion level for the true DIF effect. Can be an
#'   integer index or a criterion label such as `"C04"`. Defaults to the last
#'   criterion level in each design.
#' @param dif_effect True DIF effect size added to the focal group on the target
#'   criterion.
#' @param bias_rater Target rater level for the true interaction-bias effect.
#'   Can be an integer index or a label such as `"R04"`. Defaults to the last
#'   rater level in each design.
#' @param bias_criterion Target criterion level for the true interaction-bias
#'   effect. Can be an integer index or a criterion label. Defaults to the last
#'   criterion level in each design.
#' @param bias_effect True interaction-bias effect added to the target
#'   `Rater x Criterion` cell.
#' @param score_levels Number of ordered score categories.
#' @param theta_sd Standard deviation of simulated person measures.
#' @param rater_sd Standard deviation of simulated rater severities.
#' @param criterion_sd Standard deviation of simulated criterion difficulties.
#' @param noise_sd Optional observation-level noise added to the linear predictor.
#' @param step_span Spread of step thresholds on the logit scale.
#' @param fit_method Estimation method passed to [fit_mfrm()].
#' @param model Measurement model passed to [fit_mfrm()].
#' @param step_facet Step facet passed to [fit_mfrm()] when `model = "PCM"`.
#'   When left `NULL`, the function inherits the generator step facet from
#'   `sim_spec` when available and otherwise defaults to `"Criterion"`.
#' @param maxit Maximum iterations passed to [fit_mfrm()].
#' @param quad_points Quadrature points for `fit_method = "MML"`.
#' @param residual_pca Residual PCA mode passed to [diagnose_mfrm()].
#' @param sim_spec Optional output from [build_mfrm_sim_spec()] or
#'   [extract_mfrm_sim_spec()] used as the base data-generating mechanism.
#'   When supplied, the design grid still varies `n_person`, `n_rater`,
#'   `n_criterion`, and `raters_per_person`, but latent spread, thresholds,
#'   and other generator settings come from `sim_spec`. The target DIF and
#'   interaction-bias signals specified in this function override any signal
#'   tables stored in `sim_spec`.
#' @param dif_method Differential-functioning method passed to [analyze_dff()].
#' @param dif_min_obs Minimum observations per group cell for [analyze_dff()].
#' @param dif_p_adjust P-value adjustment method passed to [analyze_dff()].
#' @param dif_p_cut P-value cutoff for counting a target DIF detection.
#' @param dif_abs_cut Optional absolute contrast cutoff used when counting a
#'   target DIF detection. When omitted, the effective default is `0.43` for
#'   `dif_method = "refit"` and `0` (no additional magnitude cutoff) for
#'   `dif_method = "residual"`.
#' @param bias_max_iter Maximum iterations passed to [estimate_bias()].
#' @param bias_p_cut P-value cutoff for counting a target bias screen-positive result.
#' @param bias_abs_t Absolute t cutoff for counting a target bias screen-positive result.
#' @param seed Optional seed for reproducible replications.
#'
#' @details
#' This function performs Monte Carlo design screening for two related tasks:
#' DIF detection via [analyze_dff()] and interaction-bias screening via
#' [estimate_bias()].
#'
#' For each design condition (combination of `n_person`, `n_rater`,
#' `n_criterion`, `raters_per_person`), the function:
#' 1. Generates synthetic data with [simulate_mfrm_data()]
#' 2. Injects one known Group \eqn{\times} Criterion DIF effect
#'    (`dif_effect` logits added to the focal group on the target criterion)
#' 3. Injects one known Rater \eqn{\times} Criterion interaction-bias
#'    effect (`bias_effect` logits)
#' 4. Fits and diagnoses the MFRM
#' 5. Runs [analyze_dff()] and [estimate_bias()]
#' 6. Records whether the injected signals were detected or screen-positive
#'
#' **Detection criteria**:
#' A DIF signal is counted as "detected" when the target contrast has
#' \eqn{p <} `dif_p_cut` **and**, when an absolute contrast cutoff is in
#' force, \eqn{|\mathrm{Contrast}| \ge} `dif_abs_cut`. For
#' `dif_method = "refit"`, `dif_abs_cut` is interpreted on the logit scale.
#' For `dif_method = "residual"`, the residual-contrast screening result is
#' used and the default is to rely on the significance test alone.
#'
#' Bias results are different: [estimate_bias()] reports `t` and `Prob.` as
#' screening metrics rather than formal inferential quantities. Here, a bias
#' cell is counted as **screen-positive** only when those screening metrics are
#' available and satisfy
#' \eqn{p <} `bias_p_cut` **and** \eqn{|t| \ge} `bias_abs_t`.
#'
#' **Power** is the proportion of replications in which the target signal
#' was correctly detected. For DIF this is a conventional power summary.
#' For bias, the primary summary is `BiasScreenRate`, a screening hit rate
#' rather than formal inferential power.
#'
#' **False-positive rate** is the proportion of non-target cells that were
#' incorrectly flagged. For DIF this is interpreted in the usual testing
#' sense. For bias, `BiasScreenFalsePositiveRate` is a screening rate and
#' should not be read as a calibrated inferential alpha level.
#'
#' **Default effect sizes**: `dif_effect = 0.6` logits corresponds to a
#' moderate criterion-linked differential-functioning effect; `bias_effect = -0.8`
#' logits represents a substantial rater-criterion interaction.  Adjust
#' these to match the smallest effect size of practical concern for your
#' application.
#'
#' This is again a **parametric simulation study**. The function does not
#' estimate a new design directly from one observed dataset. Instead, it
#' evaluates detection or screening behavior under user-specified design
#' conditions and known injected signals.
#'
#' If you want to approximate a real study, choose the design grid and
#' simulation settings so that they reflect the empirical context of interest.
#' For example, you may set `n_person`, `n_rater`, `n_criterion`,
#' `raters_per_person`, and the latent-spread arguments to values motivated by
#' an existing assessment program, then study how operating characteristics
#' change as those design settings vary.
#'
#' When `sim_spec` is supplied, the function uses it as the explicit
#' data-generating mechanism for the latent spreads, thresholds, and assignment
#' archetype, while still injecting the requested target DIF and bias effects
#' for each design condition.
#'
#' @section References:
#' The simulation logic follows the general Monte Carlo / operating-characteristic
#' framework described by Morris, White, and Crowther (2019) and the
#' ADEMP-oriented planning/reporting guidance summarized for psychology by
#' Siepe et al. (2024). In `mfrmr`, `evaluate_mfrm_signal_detection()` is a
#' many-facet screening helper specialized to DIF and interaction-bias use
#' cases; it is not a direct implementation of one published many-facet Rasch
#' simulation design.
#'
#' - Morris, T. P., White, I. R., & Crowther, M. J. (2019).
#'   *Using simulation studies to evaluate statistical methods*.
#'   Statistics in Medicine, 38(11), 2074-2102.
#' - Siepe, B. S., Bartos, F., Morris, T. P., Boulesteix, A.-L., Heck, D. W.,
#'   & Pawel, S. (2024). *Simulation studies for methodological research in
#'   psychology: A standardized template for planning, preregistration, and
#'   reporting*. Psychological Methods.
#'
#' @return An object of class `mfrm_signal_detection` with:
#' - `design_grid`: evaluated design conditions
#' - `results`: replicate-level detection results
#' - `rep_overview`: run-level status and timing
#' - `settings`: signal-analysis settings
#' - `ademp`: simulation-study metadata (aims, DGM, estimands, methods, performance measures)
#' @seealso [simulate_mfrm_data()], [evaluate_mfrm_design()], [analyze_dff()], [analyze_dif()], [estimate_bias()]
#' @examples
#' sig_eval <- suppressWarnings(evaluate_mfrm_signal_detection(
#'   n_person = 20,
#'   n_rater = 3,
#'   n_criterion = 3,
#'   raters_per_person = 2,
#'   reps = 1,
#'   maxit = 10,
#'   bias_max_iter = 1,
#'   seed = 123
#' ))
#' s_sig <- summary(sig_eval)
#' s_sig$detection_summary[, c("n_person", "DIFPower", "BiasScreenRate")]
#' @export
evaluate_mfrm_signal_detection <- function(n_person = c(30, 50, 100),
                                           n_rater = c(4),
                                           n_criterion = c(4),
                                           raters_per_person = n_rater,
                                           reps = 10,
                                           group_levels = c("A", "B"),
                                           reference_group = NULL,
                                           focal_group = NULL,
                                           dif_level = NULL,
                                           dif_effect = 0.6,
                                           bias_rater = NULL,
                                           bias_criterion = NULL,
                                           bias_effect = -0.8,
                                           score_levels = 4,
                                           theta_sd = 1,
                                           rater_sd = 0.35,
                                           criterion_sd = 0.25,
                                           noise_sd = 0,
                                           step_span = 1.4,
                                           fit_method = c("JML", "MML"),
                                           model = c("RSM", "PCM"),
                                           step_facet = NULL,
                                           maxit = 25,
                                           quad_points = 7,
                                           residual_pca = c("none", "overall", "facet", "both"),
                                           sim_spec = NULL,
                                           dif_method = c("residual", "refit"),
                                           dif_min_obs = 10,
                                           dif_p_adjust = "holm",
                                           dif_p_cut = 0.05,
                                           dif_abs_cut = 0.43,
                                           bias_max_iter = 2,
                                           bias_p_cut = 0.05,
                                           bias_abs_t = 2,
                                           seed = NULL) {
  dif_abs_cut_missing <- missing(dif_abs_cut)
  fit_method <- match.arg(fit_method)
  model <- match.arg(model)
  residual_pca <- match.arg(residual_pca)
  dif_method <- match.arg(dif_method)
  if (!is.null(sim_spec) && !inherits(sim_spec, "mfrm_sim_spec")) {
    stop("`sim_spec` must be output from build_mfrm_sim_spec() or extract_mfrm_sim_spec().", call. = FALSE)
  }
  reps <- as.integer(reps[1])
  if (!is.finite(reps) || reps < 1L) stop("`reps` must be >= 1.", call. = FALSE)
  dif_abs_cut <- as.numeric(dif_abs_cut[1])
  if (!is.finite(dif_abs_cut) || dif_abs_cut < 0) {
    stop("`dif_abs_cut` must be a single non-negative numeric value.", call. = FALSE)
  }
  dif_abs_cut_effective <- if (dif_abs_cut_missing && identical(dif_method, "residual")) {
    0
  } else {
    dif_abs_cut
  }

  group_levels <- unique(as.character(group_levels))
  group_levels <- group_levels[!is.na(group_levels) & nzchar(group_levels)]
  if (length(group_levels) < 2L) {
    stop("`group_levels` must contain at least two non-empty labels.", call. = FALSE)
  }
  if (is.null(reference_group)) reference_group <- group_levels[1]
  if (is.null(focal_group)) focal_group <- group_levels[2]
  if (!reference_group %in% group_levels || !focal_group %in% group_levels) {
    stop("`reference_group` and `focal_group` must be members of `group_levels`.", call. = FALSE)
  }
  if (identical(reference_group, focal_group)) {
    stop("`reference_group` and `focal_group` must differ.", call. = FALSE)
  }

  design_grid <- expand.grid(
    n_person = as.integer(n_person),
    n_rater = as.integer(n_rater),
    n_criterion = as.integer(n_criterion),
    raters_per_person = as.integer(raters_per_person),
    KEEP.OUT.ATTRS = FALSE,
    stringsAsFactors = FALSE
  )
  design_grid <- design_grid[design_grid$raters_per_person <= design_grid$n_rater, , drop = FALSE]
  if (nrow(design_grid) == 0) {
    stop("No valid design rows remain after enforcing `raters_per_person <= n_rater`.", call. = FALSE)
  }
  design_grid$design_id <- sprintf("S%02d", seq_len(nrow(design_grid)))
  design_grid <- design_grid[, c("design_id", "n_person", "n_rater", "n_criterion", "raters_per_person")]

  generator_model <- if (is.null(sim_spec)) model else sim_spec$model
  generator_step_facet <- if (is.null(sim_spec)) if (identical(generator_model, "PCM")) "Criterion" else NA_character_ else sim_spec$step_facet
  generator_assignment <- if (is.null(sim_spec)) "design_dependent" else sim_spec$assignment
  fit_step_facet <- simulation_resolve_fit_step_facet(model, step_facet, generator_step_facet)

  seeds <- with_preserved_rng_seed(
    seed,
    sample.int(.Machine$integer.max, size = nrow(design_grid) * reps, replace = FALSE)
  )
  seed_idx <- 0L

  result_rows <- vector("list", nrow(design_grid) * reps)
  rep_rows <- vector("list", nrow(design_grid) * reps)
  out_idx <- 0L

  for (i in seq_len(nrow(design_grid))) {
    design <- design_grid[i, , drop = FALSE]
    dif_target <- signal_eval_resolve_level(dif_level, "C", design$n_criterion, "dif_level")
    bias_rater_target <- signal_eval_resolve_level(bias_rater, "R", design$n_rater, "bias_rater")
    bias_criterion_target <- signal_eval_resolve_level(bias_criterion, "C", design$n_criterion, "bias_criterion")

    for (rep in seq_len(reps)) {
      seed_idx <- seed_idx + 1L
      dif_tbl <- tibble::tibble(
        Group = focal_group,
        Criterion = dif_target,
        Effect = as.numeric(dif_effect[1])
      )
      bias_tbl <- tibble::tibble(
        Rater = bias_rater_target,
        Criterion = bias_criterion_target,
        Effect = as.numeric(bias_effect[1])
      )

      row_spec <- if (is.null(sim_spec)) {
        NULL
      } else {
        simulation_override_spec_design(
          sim_spec,
          n_person = design$n_person,
          n_rater = design$n_rater,
          n_criterion = design$n_criterion,
          raters_per_person = design$raters_per_person,
          group_levels = group_levels,
          dif_effects = if (isTRUE(abs(as.numeric(dif_effect[1])) > 0)) dif_tbl else NULL,
          interaction_effects = if (isTRUE(abs(as.numeric(bias_effect[1])) > 0)) bias_tbl else NULL
        )
      }
      row_score_levels <- if (is.null(row_spec)) score_levels else row_spec$score_levels
      sim <- if (is.null(row_spec)) {
        simulate_mfrm_data(
          n_person = design$n_person,
          n_rater = design$n_rater,
          n_criterion = design$n_criterion,
          raters_per_person = design$raters_per_person,
          score_levels = score_levels,
          theta_sd = theta_sd,
          rater_sd = rater_sd,
          criterion_sd = criterion_sd,
          noise_sd = noise_sd,
          step_span = step_span,
          group_levels = group_levels,
          dif_effects = if (isTRUE(abs(as.numeric(dif_effect[1])) > 0)) dif_tbl else NULL,
          interaction_effects = if (isTRUE(abs(as.numeric(bias_effect[1])) > 0)) bias_tbl else NULL,
          seed = seeds[seed_idx]
        )
      } else {
        simulate_mfrm_data(sim_spec = row_spec, seed = seeds[seed_idx])
      }

      t0 <- proc.time()[["elapsed"]]
      fit_args <- list(
        data = sim,
        person = "Person",
        facets = c("Rater", "Criterion"),
        score = "Score",
        method = fit_method,
        model = model,
        maxit = maxit
      )
      if (identical(model, "PCM")) fit_args$step_facet <- fit_step_facet
      if ("Weight" %in% names(sim)) fit_args$weight <- "Weight"
      if (identical(fit_method, "MML")) fit_args$quad_points <- quad_points

      fit <- tryCatch(do.call(fit_mfrm, fit_args), error = function(e) e)
      diag <- if (inherits(fit, "error")) fit else {
        tryCatch(diagnose_mfrm(fit, residual_pca = residual_pca), error = function(e) e)
      }
      dif <- if (inherits(diag, "error")) diag else {
        tryCatch(
          analyze_dff(
            fit, diag,
            facet = "Criterion",
            group = "Group",
            data = sim,
            method = dif_method,
            min_obs = dif_min_obs,
            p_adjust = dif_p_adjust
          ),
          error = function(e) e
        )
      }
      bias <- if (inherits(diag, "error")) diag else {
        tryCatch(
          estimate_bias(fit, diag, facet_a = "Rater", facet_b = "Criterion", max_iter = bias_max_iter),
          error = function(e) e
        )
      }
      elapsed <- proc.time()[["elapsed"]] - t0
      converged <- !inherits(fit, "error") && isTRUE(as.logical(fit$summary$Converged[1]))

      err_msg <- character(0)
      if (inherits(fit, "error")) err_msg <- c(err_msg, conditionMessage(fit))
      if (inherits(diag, "error")) err_msg <- c(err_msg, conditionMessage(diag))
      if (inherits(dif, "error")) err_msg <- c(err_msg, conditionMessage(dif))
      if (inherits(bias, "error")) err_msg <- c(err_msg, conditionMessage(bias))
      run_ok <- length(err_msg) == 0L

      out_idx <- out_idx + 1L
      rep_rows[[out_idx]] <- tibble::tibble(
        design_id = design$design_id,
        rep = rep,
        n_person = design$n_person,
        n_rater = design$n_rater,
        n_criterion = design$n_criterion,
        raters_per_person = design$raters_per_person,
        Observations = nrow(sim),
        MinCategoryCount = min(tabulate(sim$Score, nbins = row_score_levels)),
        ElapsedSec = elapsed,
        RunOK = run_ok,
        Converged = converged,
        Error = if (length(err_msg) == 0L) NA_character_ else paste(unique(err_msg), collapse = " | ")
      )

      dif_target_row <- if (!inherits(dif, "error")) {
        signal_eval_find_dif_row(dif[["dif_table"]], dif_target, reference_group, focal_group)
      } else {
        tibble::tibble()
      }
      dif_target_p <- signal_eval_get_p(dif_target_row)
      dif_target_contrast <- if (nrow(dif_target_row) > 0) suppressWarnings(as.numeric(dif_target_row$Contrast[1])) else NA_real_
      dif_target_ets <- if (nrow(dif_target_row) > 0) as.character(dif_target_row$ETS[1]) else NA_character_
      dif_target_class <- if (nrow(dif_target_row) > 0) as.character(dif_target_row$Classification[1]) else NA_character_
      dif_target_class_system <- if (nrow(dif_target_row) > 0) as.character(dif_target_row$ClassificationSystem[1]) else NA_character_
      dif_detected <- is.finite(dif_target_p) && dif_target_p <= dif_p_cut &&
        is.finite(dif_target_contrast) && abs(dif_target_contrast) >= dif_abs_cut_effective
      dif_class_detected <- if (identical(dif_target_class_system, "ETS")) {
        !is.na(dif_target_ets) && dif_target_ets %in% c("B", "C")
      } else {
        identical(dif_target_class, "Screen positive")
      }

      dif_fp_rate <- NA_real_
      if (!inherits(dif, "error")) {
        dif_non_target <- tibble::as_tibble(dif[["dif_table"]]) |>
          dplyr::filter(
            .data$Level != dif_target,
            (.data$Group1 == reference_group & .data$Group2 == focal_group) |
              (.data$Group1 == focal_group & .data$Group2 == reference_group)
          ) |>
          dplyr::mutate(
            p_eval = dplyr::if_else(is.finite(.data$p_adjusted), .data$p_adjusted, .data$p_value),
            Flag = is.finite(.data$p_eval) & .data$p_eval <= dif_p_cut &
              is.finite(.data$Contrast) & abs(.data$Contrast) >= dif_abs_cut_effective
          )
        dif_fp_rate <- signal_eval_false_positive_rate(dif_non_target$Flag)
      }

      bias_target_row <- if (!inherits(bias, "error")) {
        signal_eval_find_bias_row(bias[["table"]], bias_rater_target, bias_criterion_target)
      } else {
        tibble::tibble()
      }
      bias_target_p <- if (nrow(bias_target_row) > 0) suppressWarnings(as.numeric(bias_target_row$`Prob.`[1])) else NA_real_
      bias_target_t <- if (nrow(bias_target_row) > 0) suppressWarnings(as.numeric(bias_target_row$t[1])) else NA_real_
      bias_target_size <- if (nrow(bias_target_row) > 0) suppressWarnings(as.numeric(bias_target_row$`Bias Size`[1])) else NA_real_
      bias_metric_available <- is.finite(bias_target_p) && is.finite(bias_target_t)
      bias_detected <- is.finite(bias_target_p) && bias_target_p <= bias_p_cut &&
        is.finite(bias_target_t) && abs(bias_target_t) >= bias_abs_t

      bias_fp_rate <- NA_real_
      if (!inherits(bias, "error")) {
        bias_non_target <- tibble::as_tibble(bias[["table"]]) |>
          dplyr::filter(!(.data$FacetA_Level == bias_rater_target & .data$FacetB_Level == bias_criterion_target)) |>
          dplyr::mutate(
            Flag = is.finite(.data$`Prob.`) & .data$`Prob.` <= bias_p_cut &
              is.finite(.data$t) & abs(.data$t) >= bias_abs_t
          )
        bias_fp_rate <- signal_eval_false_positive_rate(bias_non_target$Flag)
      }

      result_rows[[out_idx]] <- tibble::tibble(
        design_id = design$design_id,
        rep = rep,
        n_person = design$n_person,
        n_rater = design$n_rater,
        n_criterion = design$n_criterion,
        raters_per_person = design$raters_per_person,
        Observations = nrow(sim),
        ElapsedSec = elapsed,
        Converged = converged,
        DIFTargetLevel = dif_target,
        DIFContrast = dif_target_contrast,
        DIFP = dif_target_p,
        DIFClassificationSystem = dif_target_class_system,
        DIFClassification = dif_target_class,
        DIFETS = dif_target_ets,
        DIFDetected = dif_detected,
        DIFClassDetected = dif_class_detected,
        DIFFalsePositiveRate = dif_fp_rate,
        BiasTargetRater = bias_rater_target,
        BiasTargetCriterion = bias_criterion_target,
        BiasSize = bias_target_size,
        BiasP = bias_target_p,
        BiasT = bias_target_t,
        BiasScreenMetricAvailable = bias_metric_available,
        BiasDetected = bias_detected,
        BiasScreenFalsePositiveRate = bias_fp_rate
      )
    }
  }

  structure(
    list(
      design_grid = design_grid,
      results = dplyr::bind_rows(result_rows),
      rep_overview = dplyr::bind_rows(rep_rows),
      settings = list(
        group_levels = group_levels,
        reference_group = reference_group,
        focal_group = focal_group,
        dif_level = dif_level,
        dif_effect = dif_effect,
        bias_rater = bias_rater,
        bias_criterion = bias_criterion,
        bias_effect = bias_effect,
        reps = reps,
        score_levels = score_levels,
        theta_sd = theta_sd,
        rater_sd = rater_sd,
        criterion_sd = criterion_sd,
        noise_sd = noise_sd,
        step_span = step_span,
        fit_method = fit_method,
        model = model,
        step_facet = fit_step_facet,
        maxit = maxit,
        quad_points = quad_points,
        residual_pca = residual_pca,
        dif_method = dif_method,
        dif_min_obs = dif_min_obs,
        dif_p_adjust = dif_p_adjust,
        dif_p_cut = dif_p_cut,
        dif_abs_cut = dif_abs_cut_effective,
        dif_abs_cut_input = dif_abs_cut,
        bias_max_iter = bias_max_iter,
        bias_p_cut = bias_p_cut,
        bias_abs_t = bias_abs_t,
        sim_spec = sim_spec,
        generator_model = generator_model,
        generator_step_facet = generator_step_facet,
        generator_assignment = generator_assignment,
        seed = seed
      ),
      ademp = simulation_build_ademp(
        purpose = "Assess DIF detection and interaction-bias screening behavior under repeated parametric many-facet simulations with known injected targets.",
        design_grid = design_grid,
        generator_model = generator_model,
        generator_step_facet = generator_step_facet,
        generator_assignment = generator_assignment,
        sim_spec = sim_spec,
        estimands = c(
          "DIF target-flag rate and non-target flag rate",
          "Bias screening hit rate and screening false-positive rate",
          "Target contrast and target bias summaries",
          "Convergence rate and elapsed time"
        ),
        analysis_methods = list(
          fit_method = fit_method,
          fitted_model = model,
          dif_method = dif_method,
          bias_method = "estimate_bias_screening",
          maxit = maxit,
          quad_points = if (identical(fit_method, "MML")) quad_points else NA_integer_,
          residual_pca = residual_pca
        ),
        performance_measures = c(
          "Mean detection/screening summaries across replications",
          "MCSE for means and rates",
          "Convergence rate",
          "Bias-screen metric availability rate"
        )
      )
    ),
    class = "mfrm_signal_detection"
  )
}

#' Summarize a DIF/bias screening simulation
#'
#' @param object Output from [evaluate_mfrm_signal_detection()].
#' @param digits Number of digits used in numeric summaries.
#' @param ... Reserved for generic compatibility.
#'
#' @return An object of class `summary.mfrm_signal_detection` with:
#' - `overview`: run-level overview
#' - `detection_summary`: aggregated detection rates by design
#' - `ademp`: simulation-study metadata carried forward from the original object
#' - `notes`: short interpretation notes, including the bias-side screening caveat
#' @seealso [evaluate_mfrm_signal_detection()], [plot.mfrm_signal_detection]
#' @examples
#' sig_eval <- suppressWarnings(evaluate_mfrm_signal_detection(
#'   n_person = 20,
#'   n_rater = 3,
#'   n_criterion = 3,
#'   raters_per_person = 2,
#'   reps = 1,
#'   maxit = 10,
#'   bias_max_iter = 1,
#'   seed = 123
#' ))
#' summary(sig_eval)
#' @export
summary.mfrm_signal_detection <- function(object, digits = 3, ...) {
  if (!is.list(object) || is.null(object$results) || is.null(object$rep_overview)) {
    stop("`object` must be output from evaluate_mfrm_signal_detection().")
  }
  digits <- max(0L, as.integer(digits[1]))
  out <- signal_eval_summary(object$results, object$rep_overview)

  round_df <- function(df) {
    if (!is.data.frame(df) || nrow(df) == 0) return(df)
    num_cols <- vapply(df, is.numeric, logical(1))
    df[num_cols] <- lapply(df[num_cols], round, digits = digits)
    df
  }

  out$overview <- round_df(out$overview)
  out$detection_summary <- round_df(out$detection_summary)
  out$ademp <- object$ademp %||% NULL
  class(out) <- "summary.mfrm_signal_detection"
  out
}

signal_detection_metric_label <- function(signal, metric_col) {
  if (identical(signal, "dif")) {
    return(switch(
      metric_col,
      DIFPower = "DIF target-flag rate",
      DIFFalsePositiveRate = "DIF non-target flag rate",
      MeanDIFEstimate = "Mean target contrast",
      metric_col
    ))
  }
  switch(
    metric_col,
    BiasScreenRate = "Bias screening hit rate",
    BiasScreenFalsePositiveRate = "Bias screening false-positive rate",
    MeanBiasEstimate = "Mean bias estimate",
    metric_col
  )
}

#' Plot DIF/bias screening simulation results
#'
#' @param x Output from [evaluate_mfrm_signal_detection()].
#' @param signal Whether to plot DIF or bias screening results.
#' @param metric Metric to plot. For `signal = "bias"`, `metric = "power"`
#'   maps to the screening hit rate (`BiasScreenRate`).
#' @param x_var Design variable used on the x-axis.
#' @param group_var Optional design variable used for separate lines.
#' @param draw If `TRUE`, draw with base graphics; otherwise return plotting data.
#' @param ... Reserved for generic compatibility.
#'
#' @return If `draw = TRUE`, invisibly returns plotting data. If `draw = FALSE`,
#'   returns that plotting-data list directly. The returned list includes
#'   `display_metric` and `interpretation_note` so callers can label bias-side
#'   plots as screening summaries rather than formal power/error-rate displays.
#' @seealso [evaluate_mfrm_signal_detection()], [summary.mfrm_signal_detection]
#' @examples
#' sig_eval <- suppressWarnings(evaluate_mfrm_signal_detection(
#'   n_person = 20,
#'   n_rater = 3,
#'   n_criterion = 3,
#'   raters_per_person = 2,
#'   reps = 1,
#'   maxit = 10,
#'   bias_max_iter = 1,
#'   seed = 123
#' ))
#' plot(sig_eval, signal = "dif", metric = "power", x_var = "n_person", draw = FALSE)
#' @export
plot.mfrm_signal_detection <- function(x,
                                       signal = c("dif", "bias"),
                                       metric = c("power", "false_positive", "estimate",
                                                  "screen_rate", "screen_false_positive"),
                                       x_var = c("n_person", "n_rater", "n_criterion", "raters_per_person"),
                                       group_var = NULL,
                                       draw = TRUE,
                                       ...) {
  if (!is.list(x) || is.null(x$results) || is.null(x$rep_overview)) {
    stop("`x` must be output from evaluate_mfrm_signal_detection().")
  }
  signal <- match.arg(signal)
  metric <- match.arg(metric)
  if (identical(signal, "bias")) {
    metric <- switch(
      metric,
      screen_rate = "power",
      screen_false_positive = "false_positive",
      metric
    )
  }
  x_var <- match.arg(x_var)

  sum_obj <- signal_eval_summary(x$results, x$rep_overview)
  plot_tbl <- tibble::as_tibble(sum_obj$detection_summary)
  if (nrow(plot_tbl) == 0) stop("No detection-summary rows available for plotting.")

  metric_col <- signal_eval_metric_col(signal, metric)
  varying <- c("n_person", "n_rater", "n_criterion", "raters_per_person")
  varying <- varying[varying != x_var]
  if (is.null(group_var)) {
    cand <- varying[vapply(plot_tbl[varying], function(col) length(unique(col)) > 1L, logical(1))]
    group_var <- if (length(cand) > 0) cand[1] else NULL
  } else {
    if (!group_var %in% c("n_person", "n_rater", "n_criterion", "raters_per_person")) {
      stop("`group_var` must be one of the evaluated design variables.")
    }
    if (identical(group_var, x_var)) {
      stop("`group_var` must differ from `x_var`.")
    }
  }

  if (is.null(group_var)) {
    agg_tbl <- plot_tbl |>
      dplyr::group_by(.data[[x_var]]) |>
      dplyr::summarize(y = mean(.data[[metric_col]], na.rm = TRUE), .groups = "drop") |>
      dplyr::arrange(.data[[x_var]]) |>
      dplyr::mutate(group = "All designs")
  } else {
    agg_tbl <- plot_tbl |>
      dplyr::group_by(.data[[x_var]], .data[[group_var]]) |>
      dplyr::summarize(y = mean(.data[[metric_col]], na.rm = TRUE), .groups = "drop") |>
      dplyr::arrange(.data[[x_var]], .data[[group_var]]) |>
      dplyr::rename(group = dplyr::all_of(group_var))
  }

  out <- list(
    plot = "signal_detection",
    signal = signal,
    metric = metric,
    metric_col = metric_col,
    display_metric = signal_detection_metric_label(signal, metric_col),
    interpretation_note = if (identical(signal, "bias")) {
      "Bias-side rates summarize screening behavior from estimate_bias(); they are not formal inferential power or alpha estimates."
    } else {
      "DIF-side rates summarize target/non-target flagging behavior under the selected DFF method and threshold settings."
    },
    x_var = x_var,
    group_var = group_var,
    data = agg_tbl
  )
  if (!isTRUE(draw)) return(out)

  groups <- unique(as.character(agg_tbl$group))
  cols <- grDevices::hcl.colors(max(1L, length(groups)), "Set 2")
  x_vals <- sort(unique(agg_tbl[[x_var]]))
  y_range <- range(agg_tbl$y, na.rm = TRUE)
  if (!all(is.finite(y_range))) stop("Selected metric has no finite values to plot.")

  graphics::plot(
    x = x_vals,
    y = rep(NA_real_, length(x_vals)),
    type = "n",
    xlab = x_var,
    ylab = out$display_metric,
    main = if (identical(signal, "bias")) {
      paste("Bias screening simulation:", out$display_metric)
    } else {
      paste("DIF screening simulation:", out$display_metric)
    },
    ylim = y_range
  )
  for (i in seq_along(groups)) {
    sub <- agg_tbl[as.character(agg_tbl$group) == groups[i], , drop = FALSE]
    sub <- sub[order(sub[[x_var]]), , drop = FALSE]
    graphics::lines(sub[[x_var]], sub$y, type = "b", lwd = 2, pch = 16 + (i - 1L) %% 5L, col = cols[i])
  }
  if (length(groups) > 1L) {
    graphics::legend("topleft", legend = groups, col = cols, lty = 1, lwd = 2, pch = 16 + (seq_along(groups) - 1L) %% 5L, bty = "n")
  }

  invisible(out)
}

Try the mfrmr package in your browser

Any scripts or data that you put into this service are public.

mfrmr documentation built on March 31, 2026, 1:06 a.m.