extract_shap: Internal function for computing SHAP values.
In familiar: End-to-End Automated Machine Learning and Model Evaluation

extract_shap

R Documentation

Internal function for computing SHAP values.

Description

Computes SHAP values for feature values using a familiarEnsemble.

Usage

extract_shap(
  object,
  data,
  cl = NULL,
  features = NULL,
  n_sample_points = 20L,
  shap_tolerance = waiver(),
  shap_max_iterations = waiver(),
  shap_phi_0 = waiver(),
  ensemble_method = waiver(),
  evaluation_times = waiver(),
  sample_limit = waiver(),
  detail_level = waiver(),
  aggregate_results = waiver(),
  n_important_features = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)

Arguments

`object`	A `familiarEnsemble`, which is an ensemble of one or more `familiarModel` objects, or a `familiarDataElementPredictionTable` object that contains prediction data.
`data`	A `dataObject` object, `data.table` or `data.frame` that constitutes the data that are assessed.
`cl`	Cluster created using the `parallel` package. This cluster is then used to speed up computation through parallellisation.
`features`	Features for whose values SHAP values need to be computed. defaults to all features in the model.
`n_sample_points`	Minimum number of values to sample for numeric features. By default, this is based on input dataset. But if the number of values of a feature within that dataset is too low, additional values are drawn from the feature distribution (stored with the model).
`shap_tolerance`	Relative tolerance for convergence of SHAP values. The tolerance is scaled with the range in SHAP values. Default: 0.05.
`shap_max_iterations`	Maximum iterations for convergence of SHAP values. Default: 1000
`shap_phi_0`	Reference predicted value(s). Determined from data by default.
`ensemble_method`	Method for ensembling predictions from models for the same sample. Available methods are: `median` (default): Use the median of the predicted values as the ensemble value for a sample. `mean`: Use the mean of the predicted values as the ensemble value for a sample.
`evaluation_times`	One or more time points that are used for in analysis of survival problems when data has to be assessed at a set time, e.g. calibration. If not provided explicitly, this parameter is read from settings used at creation of the underlying `familiarModel` objects. Only used for `survival` outcomes.
`sample_limit`	(optional) Set the upper limit of the number of samples that are used during evaluation steps. Cannot be fewer than 20. This setting can be specified per data element by providing a parameter value in a named list with data elements, e.g. `list("sample_similarity"=100, "permutation_vimp"=1000)`. This parameter can be set for the following data elements: `sample_similarity`, `shap`, `permutation_vimp`, and `ice_data`.
`detail_level`	(optional) Sets the level at which results are computed and aggregated. `ensemble`: Results are computed at the ensemble level, i.e. over all models in the ensemble. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the model performance of the ensemble model for each bootstrap. `hybrid` (default): Results are computed at the level of models in an ensemble. This means that, for example, bias-corrected estimates of model performance are directly computed using the models in the ensemble. If there are at least 20 trained models in the ensemble, performance is computed for each model, in contrast to `ensemble` where performance is computed for the ensemble of models. If there are less than 20 trained models in the ensemble, bootstraps are created so that at least 20 point estimates can be made. `model`: Results are computed at the model level. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the performance of the model for each bootstrap. Note that each level of detail has a different interpretation for bootstrap confidence intervals. For `ensemble` and `model` these are the confidence intervals for the ensemble and an individual model, respectively. That is, the confidence interval describes the range where an estimate produced by a respective ensemble or model trained on a repeat of the experiment may be found with the probability of the confidence level. For `hybrid`, it represents the range where any single model trained on a repeat of the experiment may be found with the probability of the confidence level. By definition, confidence intervals obtained using `hybrid` are at least as wide as those for `ensemble`. `hybrid` offers the correct interpretation if the goal of the analysis is to assess the result of a single, unspecified, model. `hybrid` is generally computationally less expensive then `ensemble`, which in turn is somewhat less expensive than `model`. A non-default `detail_level` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"="ensemble", "model_performance"="hybrid")`. This parameter can be set for the following data elements: `auc_data`, `decision_curve_analyis`, `model_performance`, `permutation_vimp`, `ice_data`, `prediction_data` and `confusion_matrix`. If results are computed from 10 samples or fewer, `ensemble` is automatically used. This prevents issues where evaluation steps do not have a required minimum number of samples for `hybrid` or `model`.
`aggregate_results`	(optional) Flag that signifies whether results should be aggregated during evaluation. If `estimation_type` is `bias_correction` or `bc`, aggregation leads to a single bias-corrected estimate. If `estimation_type` is `bootstrap_confidence_interval` or `bci`, aggregation leads to a single bias-corrected estimate with lower and upper boundaries of the confidence interval. This has no effect if `estimation_type` is `point`. The default value is equal to `TRUE` except when assessing metrics to assess model performance, as the default violin plot requires underlying data. As with `detail_level` and `estimation_type`, a non-default `aggregate_results` parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. `list("auc_data"=TRUE, , "model_performance"=FALSE)`. This parameter exists for the same elements as `estimation_type`.
`n_important_features`	(optional) Set the number of features that are evaluated in evaluation steps. Cannot be 0 or fewer. This setting can be specified per data element by providing a parameter value in a named list with data elements, e.g. `list("ice_data"=10, "permutation_vimp"=5)`. This parameter can be set for the following data elements: `ice_data`, `permutation_vimp`, and `shap`.
`is_pre_processed`	Flag that indicates whether the data was already pre-processed externally, e.g. normalised and clustered. Only used if the `data` argument is a `data.table` or `data.frame`.
`message_indent`	Number of indentation steps for messages shown during computation and extraction of various data elements.
`verbose`	Flag to indicate whether feedback should be provided on the computation and extraction of various data elements.
`...`	Unused arguments.