Introduction to the cpi package
In cpi: Conditional Predictive Impact

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
set.seed(2022)
old_digits <- options(digits=2)

Get started

The Conditional Predictive Impact (CPI) is a general test for conditional independence in supervised learning algorithms. It implements a conditional variable importance measure which can be applied to any supervised learning algorithm and loss function.

As a first example, we calculate the CPI for a random forest on the wine data with 5-fold cross validation:

library(mlr3)
library(mlr3learners)
library(cpi)

cpi(task = tsk("wine"), 
    learner = lrn("classif.ranger", predict_type = "prob", num.trees = 10),
    resampling = rsmp("cv", folds = 5))

The result is a CPI value for each feature, i.e. how much did the loss function change when the feature was replaced with its knockoff version, with corresponding standard errors, test statistics, p-values and confidence interval.

Interface with mlr3

The task, learner and resampling strategy are specified with the mlr3 package, which provides a unified interface for machine learning tasks and makes it quite easy to change these components. For example, we can change to regularized logistic regression and a simple holdout as resampling strategy:

cpi(task = tsk("wine"), 
    learner = lrn("classif.glmnet", predict_type = "prob", lambda = 0.01),
    resampling = rsmp("holdout"))

We refer to the mlr3 book for full introduction and reference.

The loss function used by the cpi() function is specified with measure. By default, the mean squared error (MSE) is used for regression and log-loss for classification. In mlr3, this corresponds to the measures "regr.mse" and "classif.logloss". We re-run the example above with simple classification error (ce):

cpi(task = tsk("wine"), 
    learner = lrn("classif.glmnet", lambda = 0.01),
    resampling = rsmp("holdout"), 
    measure = msr("classif.ce"))

Here we see more 0 CPI values because the classification error is less sensitive to small changes and hence results in lower power.

Statistical testing

The CPI offers several statistical tests to be calculated: The t-test ("t", default), Wilcoxon signed-rank test ("wilcox"), binomial test ("binom"), Fisher permutation test ("fisher") and Bayesian testing ("bayes") with the package BEST. For example, we re-run the first example with Fisher's permutation test:

cpi(task = tsk("wine"), 
    learner = lrn("classif.ranger", predict_type = "prob", num.trees = 10),
    resampling = rsmp("cv", folds = 5), 
    test = "fisher")

Knockoff procedures

The CPI relies on a valid knockoff sampler for the data to be analyzed. By default, second-order Gaussian knockoffs from the package knockoff are used. However, any other knockoff sampler can be used by changing the knockoff_fun or the x_tilde argument in the cpi() function. Here, knockoff_fun expects a function taking a data.frame with the original data as input and returning a data.frame with the knockoffs. For example, we use sequential knockoffs from the seqknockoff package^[seqknockoff is not on CRAN yet; available here: https://github.com/kormama1/seqknockoff]:

mytask <- as_task_regr(iris, target = "Petal.Length")
cpi(task = mytask, learner = lrn("regr.ranger", num.trees = 10), 
    resampling = rsmp("cv", folds = 5), 
    knockoff_fun = seqknockoff::knockoffs_seq)

The x_tilde argument directly takes the knockoff data:

library(seqknockoff)
x_tilde <- knockoffs_seq(iris[, -3])
mytask <- as_task_regr(iris, target = "Petal.Length")
cpi(task = mytask, learner = lrn("regr.ranger", num.trees = 10), 
    resampling = rsmp("cv", folds = 5), 
    x_tilde = x_tilde)

Group CPI

Instead of calculating the CPI for each feature separately, we can also calculate it for groups of features by replacing data of whole groups with the respective knockoff data. In cpi() this can be done with the groups argument:

cpi(task = tsk("iris"), 
    learner = lrn("classif.glmnet", predict_type = "prob", lambda = 0.01),
    resampling = rsmp("holdout"), 
    groups = list(Sepal = 1:2, Petal = 3:4))

Parallelization

For parallel execution, we need to register a parallel backend. Parallelization will be performed by the features, i.e. the CPI for each feature will be calculated in parallel. For example:

doParallel::registerDoParallel(4)
cpi(task = tsk("wine"), 
    learner = lrn("classif.ranger", predict_type = "prob", num.trees = 10),
    resampling = rsmp("cv", folds = 5))