knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(2022) old_digits <- options(digits=2)

The Conditional Predictive Impact (CPI) is a general test for conditional independence in supervised learning algorithms. It implements a conditional variable importance measure which can be applied to any supervised learning algorithm and loss function.

As a first example, we calculate the CPI for a random forest on the wine data with 5-fold cross validation:

library(mlr3) library(mlr3learners) library(cpi) cpi(task = tsk("wine"), learner = lrn("classif.ranger", predict_type = "prob", num.trees = 10), resampling = rsmp("cv", folds = 5))

The result is a CPI value for each feature, i.e. how much did the loss function change when the feature was replaced with its knockoff version, with corresponding standard errors, test statistics, p-values and confidence interval.

The task, learner and resampling strategy are specified with the *mlr3* package, which provides a unified interface for machine learning tasks and makes it quite easy to change these components. For example, we can change to regularized logistic regression and a simple holdout as resampling strategy:

cpi(task = tsk("wine"), learner = lrn("classif.glmnet", predict_type = "prob", lambda = 0.01), resampling = rsmp("holdout"))

We refer to the mlr3 book for full introduction and reference.

The loss function used by the `cpi()`

function is specified with `measure`

. By default, the mean squared error (MSE) is used for regression and log-loss for classification. In *mlr3*, this corresponds to the measures `"regr.mse"`

and `"classif.logloss"`

. We re-run the example above with simple classification error (ce):

cpi(task = tsk("wine"), learner = lrn("classif.glmnet", lambda = 0.01), resampling = rsmp("holdout"), measure = msr("classif.ce"))

Here we see more 0 CPI values because the classification error is less sensitive to small changes and hence results in lower power.

The CPI offers several statistical tests to be calculated: The *t*-test (`"t"`

, default), Wilcoxon signed-rank test (`"wilcox"`

), binomial test (`"binom"`

), Fisher permutation test (`"fisher"`

) and Bayesian testing (`"bayes"`

) with the package *BEST*. For example, we re-run the first example with Fisher's permutation test:

cpi(task = tsk("wine"), learner = lrn("classif.ranger", predict_type = "prob", num.trees = 10), resampling = rsmp("cv", folds = 5), test = "fisher")

The CPI relies on a valid knockoff sampler for the data to be analyzed. By default, second-order Gaussian knockoffs from the package *knockoff* are used. However, any other knockoff sampler can be used by changing the `knockoff_fun`

or the `x_tilde`

argument in the `cpi()`

function. Here, `knockoff_fun`

expects a function taking a `data.frame`

with the original data as input and returning a `data.frame`

with the knockoffs. For example, we use sequential knockoffs from the *seqknockoff* package^[*seqknockoff* is not on CRAN yet; available here: https://github.com/kormama1/seqknockoff]:

mytask <- as_task_regr(iris, target = "Petal.Length") cpi(task = mytask, learner = lrn("regr.ranger", num.trees = 10), resampling = rsmp("cv", folds = 5), knockoff_fun = seqknockoff::knockoffs_seq)

The `x_tilde`

argument directly takes the knockoff data:

library(seqknockoff) x_tilde <- knockoffs_seq(iris[, -3]) mytask <- as_task_regr(iris, target = "Petal.Length") cpi(task = mytask, learner = lrn("regr.ranger", num.trees = 10), resampling = rsmp("cv", folds = 5), x_tilde = x_tilde)

Instead of calculating the CPI for each feature separately, we can also calculate it for groups of features by replacing data of whole groups with the respective knockoff data. In `cpi()`

this can be done with the `groups`

argument:

cpi(task = tsk("iris"), learner = lrn("classif.glmnet", predict_type = "prob", lambda = 0.01), resampling = rsmp("holdout"), groups = list(Sepal = 1:2, Petal = 3:4))

For parallel execution, we need to register a parallel backend. Parallelization will be performed by the features, i.e. the CPI for each feature will be calculated in parallel. For example:

doParallel::registerDoParallel(4) cpi(task = tsk("wine"), learner = lrn("classif.ranger", predict_type = "prob", num.trees = 10), resampling = rsmp("cv", folds = 5))

```
options(old_digits)
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.