create_p_funs | R Documentation |
Calculating p-values is not trivial for ALE statistics because ALE is
non-parametric and model-agnostic. Because ALE is non-parametric (that is,
it does not assume any particular distribution of data), the ale
package
generates p-values by calculating ALE for many random variables; this makes the
procedure somewhat slow. For this reason, they are not calculated by default;
they must be explicitly requested. Because the ale
package is model-agnostic (that is, it
works with any kind of R model), the ale()
function cannot always automatically
manipulate the model object to create the p-values. It can only do so for
models that follow the standard R statistical modelling conventions, which
includes almost all built-in R algorithms (like stats::lm()
and stats::glm()
) and many widely
used statistics packages (like mgcv
and survival
), but which excludes most
machine learning algorithms (like tidymodels
and caret
). For non-standard
algorithms, the user needs to do a little work to help the ale function
correctly manipulate its model object:
The full model call must be passed as a character string in the argument 'random_model_call_string', with two slight modifications as follows.
In the formula that specifies the model, you must add a variable named
'random_variable'. This corresponds to the random variables that create_p_funs()
will use to estimate p-values.
The dataset on which the model is trained must be named 'rand_data'. This corresponds to the modified datasets that will be used to train the random variables.
See the example below for how this is implemented.
create_p_funs(
data,
model,
...,
parallel = parallel::detectCores(logical = FALSE) - 1,
model_packages = as.character(NA),
random_model_call_string = NULL,
random_model_call_string_vars = character(),
y_col = NULL,
pred_fun = function(object, newdata, type = pred_type) {
stats::predict(object =
object, newdata = newdata, type = type)
},
pred_type = "response",
rand_it = 1000,
silent = FALSE,
.testing_mode = FALSE
)
data |
See documentation for |
model |
See documentation for |
... |
not used. Inserted to require explicit naming of subsequent arguments. |
parallel |
See documentation for |
model_packages |
See documentation for |
random_model_call_string |
character string. If NULL, |
random_model_call_string_vars |
See documentation for |
y_col |
See documentation for |
pred_fun , pred_type |
See documentation for |
rand_it |
non-negative integer length 1. Number of times that the model should be retrained with a new random variable. The default of 1000 should give reasonably stable p-values. It can be reduced as low as 100 for faster test runs. |
silent |
See documentation for |
.testing_mode |
logical. Internal use only. |
The return value is a list of class c('p_funs', 'ale', 'list'
) with an
ale_version
attribute whose value is the version of the ale
package used to
create the object. See examples for an illustration of how to inspect this list.
Its elements are:
value_to_p
: a list of functions named for each each available ALE statistic.
Each function signature is function(x)
where x is a numeric. The function returns
the p-value (minimum 0; maximum 1) for the respective statistic based on the random variable analysis.
For an input x that returns p, its interpretation is that p% of random variables
obtained the same or higher statistic value. For example, to get the p-value
of a NALED of 4.2, enter p_funs$value_to_p(4.2)
. A return value of 0.03 means
that only 3% of random variables obtained a NALED greater than or equal to 4.2.
p_to_random_value
: a list of functions named for each each available ALE statistic.
These are the inverse functions of value_to_p
. The signature is function(p)
where p is a numeric from 0 to 1. The function returns the numeric value of the
random variable statistic that would yield the provided p-value.
For an input p that returns x, its interpretation is that p% of random variables
obtained the same or higher statistic value. For example, to get the random
variable ALED for the 0.05 p-value, enter p_funs$p_to_random_value(0.05)
.
A return value of 102 means that only 5% of random variables obtained an ALED
greater than or equal to 102.
rand_stats
: a tibble whose rows are each of the rand_it
iterations of the
random variable analysis and whose columns are the ALE statistics obtained for
each random variable.
residuals
: the actual y_col
values from data
minus the predicted
values from the model
(without random variables) on the data
.
residual_distribution
: the closest estimated distribution for the residuals
as determined by univariateML::rml()
. This is the distribution used to generate
all the random variables.
The ale
package takes a literal frequentist approach to the calculation of
p-values. That is, it literally retrains the model 1000 times, each time
modifying it by adding a distinct random variable to the model.
(The number of iterations is customizable
with the rand_it
argument.) The ALEs and ALE statistics are calculated for
each random variable. The percentiles of the distribution of these
random-variable ALEs are then used to determine p-values for non-random variables.
Thus, p-values are interpreted as the frequency of random variable ALE statistics
that exceed the value of ALE statistic of the actual variable in question.
The specific steps are as follows:
The residuals of the original model trained on the training data are calculated (residuals are the actual y target value minus the predicted values).
The closest distribution of the residuals is detected with
univariateML::model_select()
.
1000 new models are trained by generating a random variable each time with
univariateML::rml()
and then training a new model with that random variable
added.
The ALEs and ALE statistics are calculated for each random variable.
For each ALE statistic, the empirical cumulative distribution function
(from stats::ecdf()
) is used to create a function to determine p-values
according to the distribution of the random variables' ALE statistics.
Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.
# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]
# Create a GAM model with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
cut + color + clarity,
data = diamonds_sample
)
summary(gam_diamonds)
# Create p-value functions
pf_diamonds <- create_p_funs(
diamonds_sample,
gam_diamonds,
# only 100 iterations for a quick demo; but usually should remain at 1000
rand_it = 100,
)
# Examine the structure of the returned object
str(pf_diamonds)
# In RStudio: View(pf_diamonds)
# Calculate ALEs with p-values
ale_gam_diamonds <- ale(
diamonds_sample,
gam_diamonds,
p_values = pf_diamonds
)
# Plot the ALE data. The horizontal bands in the plots use the p-values.
ale_gam_diamonds$plots |>
patchwork::wrap_plots()
# For non-standard models that give errors with the default settings,
# you can use 'random_model_call_string' to specify a model for the estimation
# of p-values from random variables as in this example.
# See details above for an explanation.
pf_diamonds <- create_p_funs(
diamonds_sample,
gam_diamonds,
random_model_call_string = 'mgcv::gam(
price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
cut + color + clarity + random_variable,
data = rand_data
)',
# only 100 iterations for a quick demo; but usually should remain at 1000
rand_it = 100,
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.