simChef: Intensive Computational Experiments Made Easy

eval_feature_selection_curve_funs

R Documentation

Evaluate and/or summarize ROC or PR curves for feature selection.

Description

Evaluate the ROC or PR curves corresponding to the selected features, given the true feature support and the estimated feature importances. eval_feature_selection_curve() evaluates the ROC or PR curve for each experimental replicate separately. summarize_feature_selection_curve() summarizes the ROC or PR curve across experimental replicates.

Usage

eval_feature_selection_curve(
  fit_results,
  vary_params = NULL,
  nested_cols = NULL,
  truth_col,
  imp_col,
  group_cols = NULL,
  curve = c("ROC", "PR"),
  na_rm = FALSE
)

summarize_feature_selection_curve(
  fit_results,
  vary_params = NULL,
  nested_cols = NULL,
  truth_col,
  imp_col,
  group_cols = NULL,
  curve = c("ROC", "PR"),
  na_rm = FALSE,
  x_grid = seq(0, 1, by = 0.01),
  summary_funs = c("mean", "median", "min", "max", "sd", "raw"),
  custom_summary_funs = NULL,
  eval_id = ifelse(curve == "PR", "precision", "TPR")
)

Arguments

`fit_results`	A tibble, as returned by `fit_experiment()`.
`vary_params`	A vector of `DGP` or `Method` parameter names that are varied across in the `Experiment`.
`nested_cols`	(Optional) A character string or vector specifying the name of the column(s) in `fit_results` that need to be unnested before evaluating results. Default is `NULL`, meaning no columns in `fit_results` need to be unnested prior to computation.
`truth_col`	A character string identifying the column in `fit_results` with the true feature support data. Each element in this column should be an array of length `p`, where `p` is the number of features. Elements in this array should be binary with `TRUE` or `1` meaning the feature (corresponding to that slot) is in the support and `FALSE` or `0` meaning the feature is not in the support.
`imp_col`	A character string identifying the column in `fit_results` with the estimated feature importance data. Each element in this column should be an array of length `p`, where `p` is the number of features and the feature order aligns with that of `truth_col`. Elements in this array should be numeric where a higher magnitude indicates a more important feature.
`group_cols`	(Optional) A character string or vector specifying the column(s) to group rows by before evaluating metrics. This is useful for assessing within-group metrics.
`curve`	Either "ROC" or "PR" indicating whether to evaluate the ROC or Precision-Recall curve.
`na_rm`	A `logical` value indicating whether `NA` values should be stripped before the computation proceeds.
`x_grid`	Vector of values between 0 and 1 at which to evaluate the ROC or PR curve. If `curve = "ROC"`, the provided vector of values are the FPR values at which to evaluate the TPR, and if `curve = "PR"`, the values are the recall values at which to evaluate the precision.
`summary_funs`	Character vector specifying how to summarize evaluation metrics. Must choose from a built-in library of summary functions - elements of the vector must be one of "mean", "median", "min", "max", "sd", "raw".
`custom_summary_funs`	Named list of custom functions to summarize results. Names in the list should correspond to the name of the summary function. Values in the list should be a function that takes in one argument, that being the values of the evaluated metrics.
`eval_id`	Character string. ID to be used as a suffix when naming result columns. Default `NULL` does not add any ID to the column names.

Value

The output of eval_feature_selection_curve() is a tibble with the following columns:

.rep: Replicate ID.
.dgp_name: Name of DGP.
.method_name: Name of Method.
curve_estimate: A list of tibbles with x and y coordinate values for the ROC/PR curve for the given experimental replicate. If curve = "ROC", the tibble has the columns .threshold, FPR, and TPR for the threshold, false positive rate, and true positive rate, respectively. If curve = "PR", the tibble has the columns .threshold, recall, and precision.

as well as any columns specified by group_cols and vary_params.

The output of summarize_feature_selection_curve() is a grouped tibble containing both identifying information and the feature selection curve results aggregated over experimental replicates. Specifically, the identifier columns include .dgp_name, .method_name, and any columns specified by group_cols and vary_params. In addition, there are results columns corresponding to the requested statistics in summary_funs and custom_summary_funs. If curve = "ROC", these results columns include FPR and others that end in the suffix "_TPR". If curve = "PR", the results columns include recall and others that end in the suffix "_precision".

Examples

# generate example fit_results data for a feature selection problem
fit_results <- tibble::tibble(
  .rep = rep(1:2, times = 2),
  .dgp_name = c("DGP1", "DGP1", "DGP2", "DGP2"),
  .method_name = c("Method"),
  feature_info = lapply(
    1:4,
    FUN = function(i) {
      tibble::tibble(
        # feature names
        feature = c("featureA", "featureB", "featureC"),
        # true feature support
        true_support = c(TRUE, FALSE, TRUE),
        # estimated feature importance scores
        est_importance = c(10, runif(2, min = -2, max = 2))
      )
    }
  )
)

# evaluate feature selection ROC/PR curves for each replicate
roc_results <- eval_feature_selection_curve(
  fit_results,
  curve = "ROC",
  nested_cols = "feature_info",
  truth_col = "true_support",
  imp_col = "est_importance"
)
pr_results <- eval_feature_selection_curve(
  fit_results,
  curve = "PR",
  nested_cols = "feature_info",
  truth_col = "true_support",
  imp_col = "est_importance"
)
# summarize feature selection ROC/PR curves across replicates
roc_summary <- summarize_feature_selection_curve(
  fit_results,
  curve = "ROC",
  nested_cols = "feature_info",
  truth_col = "true_support",
  imp_col = "est_importance"
)
pr_summary <- summarize_feature_selection_curve(
  fit_results,
  curve = "PR",
  nested_cols = "feature_info",
  truth_col = "true_support",
  imp_col = "est_importance"
)

Yu-Group/simChef documentation built on Feb. 27, 2025, 9:19 p.m.