ALE | R Documentation |
An ALE
S7 object contains ALE data and statistics. For details, see vignette('ale-intro')
or the details and examples below.
ALE(
model,
x_cols = list(d1 = TRUE),
data = NULL,
y_col = NULL,
...,
exclude_cols = NULL,
parallel = "all",
model_packages = NULL,
output_stats = TRUE,
output_boot_data = FALSE,
pred_fun = function(object, newdata, type = pred_type) {
stats::predict(object =
object, newdata = newdata, type = type)
},
pred_type = "response",
p_values = "auto",
aler_alpha = c(0.01, 0.05),
max_num_bins = 10,
boot_it = 0,
boot_alpha = 0.05,
boot_centre = "mean",
seed = 0,
y_type = NULL,
sample_size = 500,
silent = FALSE,
.bins = NULL
)
model |
model object. Required. Model for which ALE should be calculated. May be any kind of R object that can make predictions from data. |
x_cols , exclude_cols |
character, list, or formula. Columns names from |
data |
dataframe. Dataset from which to create predictions for the ALE. It should normally be the same dataset on which |
y_col |
character(1). Name of the outcome target label (y) variable. If not provided, |
... |
not used. Inserted to require explicit naming of subsequent arguments. |
parallel |
non-negative integer(1) or character(1) in c("all", "all but one"). Number of parallel threads (workers or tasks) for parallel execution of the constructor. The default "all" uses all available physical and logical CPU cores. "all but one" uses only physical cores and reserves one core for the system. Set |
model_packages |
character. Character vector of names of packages that |
output_stats |
logical(1). If |
output_boot_data |
logical(1). If |
pred_fun , pred_type |
function,character(1). |
p_values |
instructions for calculating p-values. Possible values are:
|
aler_alpha |
numeric(2) from 0 to 1. Thresholds for p-values ("alpha") for confidence interval ranges for the ALER band if |
max_num_bins |
positive integer(1). Maximum number of ALE bins for numeric |
boot_it |
non-negative integer(1). Number of bootstrap iterations for data-only bootstrapping on ALE data. This is appropriate for models that have been developed with cross-validation. For models that have not been validated, full-model bootstrapping should be used instead with a |
boot_alpha |
numeric(1) from 0 to 1. When ALE is bootstrapped ( |
boot_centre |
character(1) in c('mean', 'median'). When bootstrapping, the main estimate for the ALE y value is considered to be |
seed |
integer(1). Random seed. Supply this between runs to assure that identical random ALE data is generated each time when bootstrapping. Without bootstrapping, ALE is a deterministic algorithm that should result in identical results each time regardless of the seed specified. However, with parallel processing enabled (as it is by default), only the exact computing setup will give reproducible results. For reproducible results across different computers, turn off parallelization with |
y_type |
character(1) in c('binary', 'numeric', 'categorical', 'ordinal'). Datatype of the y (outcome) variable. Normally determined automatically; only provide if an error message for a complex non-standard model requires it. |
sample_size |
non-negative integer(1). Size of the sample of |
silent |
logical(1), default |
.bins |
Internal use only. List of ALE bin and n count vectors. If provided, these vectors will be used to set the intervals of the ALE x axis for each variable. By default ( |
An object of class ALE
with properties effect
and params
.
Stores the ALE data and, optionally, ALE statistics and bootstrap data for one or more categories.
The parameters used to calculate the ALE data. These include most of the arguments used to construct the ALE
object. These are either the values provided by the user or those used by default if the user did not change them but also includes several objects that are created within the constructor. These extra objects are described here, as well as those parameters that are stored differently from the form in the arguments:
* `max_d`: the highest dimension of ALE data present. If only 1D ALE is present, then `max_d == 1`. If even one 2D ALE element is present (even with no 1D), then `max_d == 2`. * `requested_x_cols`,`ordered_x_cols`: `requested_x_cols` is the resolved list of `x_cols` as requested by the user (that is, `x_cols` minus `exclude_cols`). `ordered_x_cols` is the same set of `x_cols` but arranged in the internal storage order. * `y_cats`: categories for categorical classification models. For non-categorical models, this is the same as `y_col`. * `y_type`: high-level datatype of the y outcome variable. * `y_summary`: summary statistics of y values used for the ALE calculation. These statistics are based on the actual values of `y_col` unless if `y_type` is a probability or other value that is constrained in the `[0, 1]` range, in which case `y_summary` is based on the predictions of `y_col` from `model` on the `data`. `y_summary` is a named numeric matrix. For most outcomes with a single value per predicted row, there is just one column with the same name as `y_col`. For categorical y outcomes, there is one column for each category in `y_cats` plus an additional column with the same name as `y_col`; this is the mean of the categorical columns. The rows are named mostly as the percentile of the y values. E.g., the '5%' row is the 5th percentile of y values. The following named rows have special meanings: * `min`, `mean`, `max`: the minimum, mean, and maximum y values, respectively. Note that the median is `50%`, the 50th percentile. * `aler_lo_lo`, `aler_lo`, `aler_hi`, `aler_hi_hi`: When p-values are present, `aler_lo` and `aler_hi` are the inner lower and upper confidence intervals of `y_col` values with respect to the median (`50%`); `aler_lo_lo` and `aler_hi_hi` are the outer confidence intervals. See the documentation for the `aler_alpha` argument to understand how these are determined. Without p-values, these elements are absent. * `model`: selected elements that describe the `model` that the `ALE` object interprets. * `data`: selected elements that describe the `data` used to produce the `ALE` object. To avoid the large size of duplicating `data` entirely, only a sample of the size of the `sample_size` argument is retained.
The calculation of ALE requires modifying several values of the original data
. Thus, ALE()
needs direct access to the predict
function for the model
. By default, ALE()
uses a generic default predict
function of the form predict(object, newdata, type)
with the default prediction type of 'response'
. If, however, the desired prediction values are not generated with that format, the user must specify what they want. Very often, the only modification needed is to change the prediction type to some other value by setting the pred_type
argument (e.g., to 'prob'
to generated classification probabilities). But if the desired predictions need a different function signature, then the user must create a custom prediction function and pass it to pred_fun
. The requirements for this custom function are:
It must take three required arguments and nothing else:
object
: a model
newdata
: a dataframe or compatible table type such as a tibble or data.table
type
: a string; it should usually be specified as type = pred_type
These argument names are according to the R convention for the generic stats::predict()
function.
It must return a vector or matrix of numeric values as the prediction.
You can see an example below of a custom prediction function.
For details about the ALE-based statistics (ALED, ALER, NALED, and NALER), see vignette('ale-statistics')
. For general details about the calculation of p-values, see ALEpDist()
. Here, we clarify the automatic calculation of p-values with the ALE()
constructor.
As explained in the documentation above for the p_values
argument, the default p_values = "auto"
will try to automatically create a fast surrogate ALEpDist
object. However, this is on the condition that statistics are requested (default, output_stats = TRUE
) and bootstrapping is also requested (not default, if boot_it
is any value greater than 0). Requesting statistics is necessary otherwise p-values are not needed. However, the requirement for requiring bootstrapping is a pragmatic design choice. The challenge is that creating an ALEpDist
object can be slow. (Even the fast surrogate option rarely takes less than 10 seconds, even with parallelization.) Thus, to optimize speed, p-values will not be calculated unless requested. However, if the user requests bootstrapping (which is slower than not requesting it), it can be assumed that they are willing to sacrifice some speed for the sake of greater precision in their ALE analysis; thus, extra time is taken to at least create a relatively faster surrogate ALEpDist
object.
Parallel processing using the {furrr}
framework is enabled by default. The number of parallel threads (workers or cores) is specified with the parallel
argument. By default (parallel = "all"
), it will use all the available physical and logical CPU cores. However, if the procedure is very slow (with a large dataset and slow prediction algorithm), you might want to set parallel = "all but one")
, which will only use faster physical cores and reserve one physical core so that your computer does not slow down as you continue working on other tasks while the procedure runs. To disable parallel processing, set parallel = 0
.
The {ale}
package should be able to automatically recognize and load most packages that are needed, but with parallel processing enabled (which is the default), some packages might not be properly loaded. This problem might be indicated if you get a strange error message that mentions something somewhere about "progress interrupted" or "future", especially if you see such errors after the progress bars begin displaying (assuming you did not disable progress bars with silent = TRUE
). In that case, first try disabling parallel processing with parallel = 0
. If that resolves the problem, then to get faster parallel processing to work, try adding all the package names needed for the model
to the model_packages
argument, e.g., model_packages = c('tidymodels', 'mgcv')
.
For time-to-event (survival) models, set the following arguments:
y_col
must be the set to the name of the binary event column.
Include the time column in the exclude_cols
argument so that its ALE will not be calculated, e.g., exclude_cols = 'time'
. This is not essential but if it is not excluded, it will always result in an exactly zero ALE effect because time is an outcome, not a predictor, of the time-to-event model's outcome, so calculating it is a waste of time.
pred_type
must be specified according to the desired type
argument for the predict()
method of the time-to-event algorithm (e.g., "risk", "survival", "time", etc.).
pred_fun
might work fine without modification as long as the settings above are configured. However, for non-standard time-to-event models, a custom pred_fun
as specified above might be needed.
Progress bars are implemented with the {progressr}
package. For details on customizing the progress bars, see the introduction to the {progressr}
package. To disable progress bars when calling a function in the ale
package, set silent = TRUE
.
Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. doi:10.48550/arXiv.2310.09877.
# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]
# Create a GAM model with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
cut + color + clarity +
ti(carat, by = clarity), # a 2D interaction
data = diamonds_sample
)
summary(gam_diamonds)
# Simple ALE without bootstrapping: by default, all 1D ALE effects
ale_gam_diamonds <- ALE(gam_diamonds)
# Simple printing of all plots
plot(ale_gam_diamonds)
# Bootstrapped ALE
# This can be slow, since bootstrapping runs the algorithm boot_it times
# Create ALE with 100 bootstrap samples
ale_gam_diamonds_boot <- ALE(
gam_diamonds,
# request all 1D ALE effects and only the carat:clarity 2D effect
list(d1 = TRUE, d2 = 'carat:clarity'),
boot_it = 100
)
#' More advanced plot manipulation
ale_plots <- plot(ale_gam_diamonds_boot) # Create an ALEPlots object
# Print the plots: First page prints 1D ALE; second page prints 2D ALE
ale_plots # or print(ale_plots) to be explicit
# Extract specific plots (as lists of ggplot objects)
get(ale_plots, 'carat') # extract a specific 1D plot
get(ale_plots, 'carat:clarity') # extract a specific 2D plot
get(ale_plots, type = 'effect') # ALE effects plot
# See help(get.ALEPlots) for more options, such as for categorical plots
# If the predict function you want is non-standard, you may define a
# custom predict function. It must return a single numeric vector.
custom_predict <- function(object, newdata, type = pred_type) {
predict(object, newdata, type = type, se.fit = TRUE)$fit
}
ale_gam_diamonds_custom <- ALE(
gam_diamonds,
pred_fun = custom_predict, pred_type = 'link'
)
# Plot the ALE data
plot(ale_gam_diamonds_custom)
# How to retrieve specific types of ALE data from an ALE object.
ale_diamonds_with_boot_data <- ALE(
gam_diamonds,
# For detailed options for x_cols, see examples at resolve_x_cols()
x_cols = ~ carat + cut + clarity + carat:clarity + color:depth,
output_boot_data = TRUE,
boot_it = 10 # just for demonstration
)
# See ?get.ALE for details on the various kinds of data that may be retrieved.
get(ale_diamonds_with_boot_data, ~ carat + color:depth) # default ALE data
get(ale_diamonds_with_boot_data, what = 'boot_data') # raw bootstrap data
get(ale_diamonds_with_boot_data, stats = 'estimate') # summary statistics
get(ale_diamonds_with_boot_data, stats = c('aled', 'naled'))
get(ale_diamonds_with_boot_data, stats = 'all')
get(ale_diamonds_with_boot_data, stats = 'conf_regions')
get(ale_diamonds_with_boot_data, stats = 'conf_sig')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.