eval_feature_selection_curve_funs: Evaluate and/or summarize ROC or PR curves for feature...

eval_feature_selection_curve_funsR Documentation

Evaluate and/or summarize ROC or PR curves for feature selection.

Description

Evaluate the ROC or PR curves corresponding to the selected features, given the true feature support and the estimated feature importances. eval_feature_selection_curve() evaluates the ROC or PR curve for each experimental replicate separately. summarize_feature_selection_curve() summarizes the ROC or PR curve across experimental replicates.

Usage

eval_feature_selection_curve(
  fit_results,
  vary_params = NULL,
  nested_cols = NULL,
  truth_col,
  imp_col,
  group_cols = NULL,
  curve = c("ROC", "PR"),
  na_rm = FALSE
)

summarize_feature_selection_curve(
  fit_results,
  vary_params = NULL,
  nested_cols = NULL,
  truth_col,
  imp_col,
  group_cols = NULL,
  curve = c("ROC", "PR"),
  na_rm = FALSE,
  x_grid = seq(0, 1, by = 0.01),
  summary_funs = c("mean", "median", "min", "max", "sd", "raw"),
  custom_summary_funs = NULL,
  eval_id = ifelse(curve == "PR", "precision", "TPR")
)

Arguments

fit_results

A tibble, as returned by fit_experiment().

vary_params

A vector of DGP or Method parameter names that are varied across in the Experiment.

nested_cols

(Optional) A character string or vector specifying the name of the column(s) in fit_results that need to be unnested before evaluating results. Default is NULL, meaning no columns in fit_results need to be unnested prior to computation.

truth_col

A character string identifying the column in fit_results with the true feature support data. Each element in this column should be an array of length p, where p is the number of features. Elements in this array should be binary with TRUE or 1 meaning the feature (corresponding to that slot) is in the support and FALSE or 0 meaning the feature is not in the support.

imp_col

A character string identifying the column in fit_results with the estimated feature importance data. Each element in this column should be an array of length p, where p is the number of features and the feature order aligns with that of truth_col. Elements in this array should be numeric where a higher magnitude indicates a more important feature.

group_cols

(Optional) A character string or vector specifying the column(s) to group rows by before evaluating metrics. This is useful for assessing within-group metrics.

curve

Either "ROC" or "PR" indicating whether to evaluate the ROC or Precision-Recall curve.

na_rm

A logical value indicating whether NA values should be stripped before the computation proceeds.

x_grid

Vector of values between 0 and 1 at which to evaluate the ROC or PR curve. If curve = "ROC", the provided vector of values are the FPR values at which to evaluate the TPR, and if curve = "PR", the values are the recall values at which to evaluate the precision.

summary_funs

Character vector specifying how to summarize evaluation metrics. Must choose from a built-in library of summary functions - elements of the vector must be one of "mean", "median", "min", "max", "sd", "raw".

custom_summary_funs

Named list of custom functions to summarize results. Names in the list should correspond to the name of the summary function. Values in the list should be a function that takes in one argument, that being the values of the evaluated metrics.

eval_id

Character string. ID to be used as a suffix when naming result columns. Default NULL does not add any ID to the column names.

Value

The output of eval_feature_selection_curve() is a tibble with the following columns:

.rep

Replicate ID.

.dgp_name

Name of DGP.

.method_name

Name of Method.

curve_estimate

A list of tibbles with x and y coordinate values for the ROC/PR curve for the given experimental replicate. If curve = "ROC", the tibble has the columns .threshold, FPR, and TPR for the threshold, false positive rate, and true positive rate, respectively. If curve = "PR", the tibble has the columns .threshold, recall, and precision.

as well as any columns specified by group_cols and vary_params.

The output of summarize_feature_selection_curve() is a grouped tibble containing both identifying information and the feature selection curve results aggregated over experimental replicates. Specifically, the identifier columns include .dgp_name, .method_name, and any columns specified by group_cols and vary_params. In addition, there are results columns corresponding to the requested statistics in summary_funs and custom_summary_funs. If curve = "ROC", these results columns include FPR and others that end in the suffix "_TPR". If curve = "PR", the results columns include recall and others that end in the suffix "_precision".

See Also

Other feature_selection_funs: eval_feature_importance_funs, eval_feature_selection_err_funs, plot_feature_importance(), plot_feature_selection_curve(), plot_feature_selection_err()

Examples

# generate example fit_results data for a feature selection problem
fit_results <- tibble::tibble(
  .rep = rep(1:2, times = 2),
  .dgp_name = c("DGP1", "DGP1", "DGP2", "DGP2"),
  .method_name = c("Method"),
  feature_info = lapply(
    1:4,
    FUN = function(i) {
      tibble::tibble(
        # feature names
        feature = c("featureA", "featureB", "featureC"),
        # true feature support
        true_support = c(TRUE, FALSE, TRUE),
        # estimated feature importance scores
        est_importance = c(10, runif(2, min = -2, max = 2))
      )
    }
  )
)

# evaluate feature selection ROC/PR curves for each replicate
roc_results <- eval_feature_selection_curve(
  fit_results,
  curve = "ROC",
  nested_cols = "feature_info",
  truth_col = "true_support",
  imp_col = "est_importance"
)
pr_results <- eval_feature_selection_curve(
  fit_results,
  curve = "PR",
  nested_cols = "feature_info",
  truth_col = "true_support",
  imp_col = "est_importance"
)
# summarize feature selection ROC/PR curves across replicates
roc_summary <- summarize_feature_selection_curve(
  fit_results,
  curve = "ROC",
  nested_cols = "feature_info",
  truth_col = "true_support",
  imp_col = "est_importance"
)
pr_summary <- summarize_feature_selection_curve(
  fit_results,
  curve = "PR",
  nested_cols = "feature_info",
  truth_col = "true_support",
  imp_col = "est_importance"
)


Yu-Group/simChef documentation built on March 25, 2024, 3:22 a.m.