partial_dependence: Partial Dependence
In effectplots: Effect Plots

View source: R/partial_dependence.R

partial_dependence

R Documentation

Partial Dependence

Description

Calculates PD for one or multiple features.

PD was introduced by Friedman (2001) to study the (main) effects of a ML model. PD of a model f and variable X at a certain value g is derived by replacing the X values in a reference data by g, and then calculating the average prediction of f over this modified data. This is done for different g to see how the average prediction of f changes in X, keeping all other feature values constant (Ceteris Paribus).

This function is a convenience wrapper around feature_effects(), which calls the barebone implementation .pd() to calculate PD. As grid points, it uses the arithmetic mean of X per bin (specified by breaks), and eventually weighted by w.

Usage

partial_dependence(object, ...)

## Default S3 method:
partial_dependence(
  object,
  v,
  data,
  pred_fun = stats::predict,
  trafo = NULL,
  which_pred = NULL,
  w = NULL,
  breaks = "Sturges",
  right = TRUE,
  discrete_m = 13L,
  outlier_iqr = 2,
  pd_n = 500L,
  seed = NULL,
  ...
)

## S3 method for class 'ranger'
partial_dependence(
  object,
  v,
  data,
  pred_fun = NULL,
  trafo = NULL,
  which_pred = NULL,
  w = NULL,
  breaks = "Sturges",
  right = TRUE,
  discrete_m = 13L,
  outlier_iqr = 2,
  pd_n = 500L,
  seed = NULL,
  ...
)

## S3 method for class 'explainer'
partial_dependence(
  object,
  v = colnames(data),
  data = object$data,
  pred_fun = object$predict_function,
  trafo = NULL,
  which_pred = NULL,
  w = object$weights,
  breaks = "Sturges",
  right = TRUE,
  discrete_m = 13L,
  outlier_iqr = 2,
  pd_n = 500L,
  seed = NULL,
  ...
)

## S3 method for class 'H2OModel'
partial_dependence(
  object,
  data,
  v = object@parameters$x,
  pred_fun = NULL,
  trafo = NULL,
  which_pred = NULL,
  w = object@parameters$weights_column$column_name,
  breaks = "Sturges",
  right = TRUE,
  discrete_m = 13L,
  outlier_iqr = 2,
  pd_n = 500L,
  seed = NULL,
  ...
)

Arguments

`object`	Fitted model.
`...`	Further arguments passed to `pred_fun()`, e.g., `type = "response"` in a `glm()` or (typically) `prob = TRUE` in classification models.
`v`	Variable names to calculate statistics for.
`data`	Matrix or data.frame.
`pred_fun`	Prediction function, by default `stats::predict`. The function takes three arguments (names irrelevant): `object`, `data`, and `...`.
`trafo`	How should predictions be transformed? A function or `NULL` (default). Examples are `log` (to switch to link scale) or `exp` (to switch from link scale to the original scale). Applied after `which_pred`.
`which_pred`	If the predictions are multivariate: which column to pick (integer or column name). By default `NULL` (picks last column). Applied before `trafo`.
`w`	Optional vector with case weights. Can also be a column name in `data`. Having observations with non-positive weight is equivalent to excluding them.
`breaks`	An integer, vector, or "Sturges" (the default) used to determine bin breaks of continuous features. Values outside the total bin range are placed in the outmost bins. To allow varying values of `breaks` across features, `breaks` can be a list of the same length as `v`, or a named list with breaks for certain variables.
`right`	Should bins be right-closed? The default is `TRUE`. Vectorized over `v`. Only relevant for continuous features.
`discrete_m`	Numeric features with up to this number of unique values should not be binned but rather treated as discrete. The default is 13. Vectorized over `v`.
`outlier_iqr`	If `breaks` is an integer or "Sturges", the breaks of a continuous feature are calculated without taking into account feature values outside quartiles +- `outlier_iqr` * IQR (where <= 9997 values are used to calculate the quartiles). To let the breaks cover the full data range, set `outlier_iqr` to 0 or `Inf`. Vectorized over `v`.
`pd_n`	Size of the data used for calculating partial dependence. The default is 500. For larger `data` (and `w`), `pd_n` rows are randomly sampled. Each variable specified by `v` uses the same sample. Set to 0 to omit PD calculations.
`seed`	Optional integer random seed used for: Partial dependence: select background data if `n > pd_n`. Calculating breaks: The bin range is determined without values outside quartiles +- 2 IQR using a sample of <= 9997 observations to calculate quartiles.

Value

A list (of class "EffectData") with a data.frame per feature having columns:

bin_mid: Bin mid points. In the plots, the bars are centered around these.
bin_width: Absolute width of the bin. In the plots, these equal the bar widths.
bin_mean: For continuous features, the (possibly weighted) average feature value within bin. For discrete features equivalent to bin_mid.
N: The number of observations within bin.
weight: The weight sum within bin. When w = NULL, equivalent to N.
Different statistics, depending on the function call.

Use single bracket subsetting to select part of the output. Note that each data.frame contains an attribute "discrete" with the information whether the feature is discrete or continuous. This attribute might be lost when you manually modify the data.frames.

Methods (by class)

partial_dependence(default): Default method.
partial_dependence(ranger): Method for ranger models.
partial_dependence(explainer): Method for DALEX explainers.
partial_dependence(H2OModel): Method for H2O models.

References

Friedman, Jerome H. 2001, Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (5): 1189-1232. doi:10.1214/aos/1013203451.

Examples

fit <- lm(Sepal.Length ~ ., data = iris)
M <- partial_dependence(fit, v = "Species", data = iris)
M |> plot()

M2 <- partial_dependence(fit, v = colnames(iris)[-1], data = iris)
plot(M2, share_y = "all")

effectplots documentation built on April 12, 2025, 2:13 a.m.