simChef: Intensive Computational Experiments Made Easy

eval_feature_selection_err_funs

R Documentation

Evaluate and/or summarize feature selection errors.

Description

Evaluate various feature selection metrics, given the true feature support and the estimated feature support. eval_feature_selection_err() evaluates the various feature selection metrics for each experimental replicate separately. summarize_feature_selection_err() summarizes the various feature selection metrics across experimental replicates.

Usage

eval_feature_selection_err(
  fit_results,
  vary_params = NULL,
  nested_cols = NULL,
  truth_col,
  estimate_col = NULL,
  imp_col,
  group_cols = NULL,
  metrics = NULL,
  na_rm = FALSE
)

summarize_feature_selection_err(
  fit_results,
  vary_params = NULL,
  nested_cols = NULL,
  truth_col,
  estimate_col = NULL,
  imp_col,
  group_cols = NULL,
  metrics = NULL,
  na_rm = FALSE,
  summary_funs = c("mean", "median", "min", "max", "sd", "raw"),
  custom_summary_funs = NULL,
  eval_id = "feature_selection"
)

Arguments

`fit_results`	A tibble, as returned by `fit_experiment()`.
`vary_params`	A vector of `DGP` or `Method` parameter names that are varied across in the `Experiment`.
`nested_cols`	(Optional) A character string or vector specifying the name of the column(s) in `fit_results` that need to be unnested before evaluating results. Default is `NULL`, meaning no columns in `fit_results` need to be unnested prior to computation.
`truth_col`	A character string identifying the column in `fit_results` with the true feature support data. Each element in this column should be an array of length `p`, where `p` is the number of features. Elements in this array should be binary with `TRUE` or `1` meaning the feature (corresponding to that slot) is in the support and `FALSE` or `0` meaning the feature is not in the support.
`estimate_col`	An (optional) character string identifying the column in `fit_results` with the estimated feature support data. Each element in this column should be an array of length `p`, where `p` is the number of features and the feature order aligns with that of `truth_col`. Elements in this array should be binary with `TRUE` or `1` meaning the feature (corresponding to that slot) is in the estimated support and `FALSE` or `0` meaning the feature is not in the estimated support. If `NULL` (default), the non-zero elements of `imp_col` are used as the estimated feature support.
`imp_col`	A character string identifying the column in `fit_results` with the estimated feature importance data. Each element in this column should be an array of length `p`, where `p` is the number of features and the feature order aligns with that of `truth_col`. Elements in this array should be numeric where a higher magnitude indicates a more important feature.
`group_cols`	(Optional) A character string or vector specifying the column(s) to group rows by before evaluating metrics. This is useful for assessing within-group metrics.
`metrics`	A `metric_set` object indicating the metrics to evaluate. See `yardstick::metric_set()` for more details. Default `NULL` will evaluate the following: number of true positives (`tp`), number of false positives (`fp`), sensitivity (`sens`), specificity (`spec`), positive predictive value (`ppv`), number of features in the estimated support (`pos`), number of features not in the estimated support (`neg`), AUROC (`roc_auc`), and AUPRC (`pr_auc`). If `na_rm = TRUE`, the number of NA values (`num_na`) is also computed.
`na_rm`	A `logical` value indicating whether `NA` values should be stripped before the computation proceeds.
`summary_funs`	Character vector specifying how to summarize evaluation metrics. Must choose from a built-in library of summary functions - elements of the vector must be one of "mean", "median", "min", "max", "sd", "raw".
`custom_summary_funs`	Named list of custom functions to summarize results. Names in the list should correspond to the name of the summary function. Values in the list should be a function that takes in one argument, that being the values of the evaluated metrics.
`eval_id`	Character string. ID to be used as a suffix when naming result columns. Default `NULL` does not add any ID to the column names.

Value

The output of eval_feature_selection_err() is a tibble with the following columns:

.rep: Replicate ID.
.dgp_name: Name of DGP.
.method_name: Name of Method.
.metric: Name of the evaluation metric.
.estimate: Value of the evaluation metric.

as well as any columns specified by group_cols and vary_params.

The output of summarize_feature_selection_err() is a grouped tibble containing both identifying information and the feature selection results aggregated over experimental replicates. Specifically, the identifier columns include .dgp_name, .method_name, any columns specified by group_cols and vary_params, and .metric. In addition, there are results columns corresponding to the requested statistics in summary_funs and custom_summary_funs. These columns end in the suffix specified by eval_id.

Examples

# generate example fit_results data for a feature selection problem
fit_results <- tibble::tibble(
  .rep = rep(1:2, times = 2),
  .dgp_name = c("DGP1", "DGP1", "DGP2", "DGP2"),
  .method_name = c("Method"),
  feature_info = lapply(
    1:4,
    FUN = function(i) {
      tibble::tibble(
        # feature names
        feature = c("featureA", "featureB", "featureC"),
        # true feature support
        true_support = c(TRUE, FALSE, TRUE),
        # estimated feature support
        est_support = c(TRUE, FALSE, FALSE),
        # estimated feature importance scores
        est_importance = c(10, runif(2, min = -2, max = 2))
      )
    }
  )
)

# evaluate feature selection (using all default metrics) for each replicate
eval_results <- eval_feature_selection_err(
  fit_results,
  nested_cols = "feature_info",
  truth_col = "true_support",
  estimate_col = "est_support",
  imp_col = "est_importance"
)
# summarize feature selection error (using all default metric) across replicates
eval_results_summary <- summarize_feature_selection_err(
  fit_results,
  nested_cols = "feature_info",
  truth_col = "true_support",
  estimate_col = "est_support",
  imp_col = "est_importance"
)

# evaluate/summarize feature selection errors using specific yardstick metrics
metrics <- yardstick::metric_set(yardstick::sens, yardstick::spec)
eval_results <- eval_feature_selection_err(
  fit_results,
  nested_cols = "feature_info",
  truth_col = "true_support",
  estimate_col = "est_support",
  imp_col = "est_importance",
  metrics = metrics
)
eval_results_summary <- summarize_feature_selection_err(
  fit_results,
  nested_cols = "feature_info",
  truth_col = "true_support",
  estimate_col = "est_support",
  imp_col = "est_importance",
  metrics = metrics
)

# summarize feature selection errors using specific summary metric
range_fun <- function(x) return(max(x) - min(x))
eval_results_summary <- summarize_feature_selection_err(
  fit_results,
  nested_cols = "feature_info",
  truth_col = "true_support",
  estimate_col = "est_support",
  imp_col = "est_importance",
  custom_summary_funs = list(range_feature_selection = range_fun)
)

Yu-Group/simChef documentation built on Feb. 27, 2025, 9:19 p.m.