knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The ffcr R package is used to construct two families of transparent classification models: fast-and-frugal trees and tallying models. The book Classification in the Wild: The Science and Art of Transparent Decision Making. [@katsikopoulos2021] describes these models, their applications and the algorithms to construct them in detail.
A fast-and-frugal tree is a decision tree with a simple structure: one branch of each node exits the tree, the other continues to the next node until the final node is reached. A tallying model gives pieces of evidence the same weight. The package contains two main functions: fftree
to train fast-and-frugal trees and tally
to train tallying models.
To illustrate the functionality of the package, we use the Liver data set [@ramana2011] that we obtained from the UCI machine learning repository [@dua2017]. It contains 579 patients of which 414 have a liver condition and the other 165 do not. We predict which patient has a liver condition using medical measures and the age and gender of the people.
The fftree
function encompasses three different methods to train fast-and-frugal trees. These are named basic, greedy, and cross-entropy.^[In the book [@katsikopoulos2021], we refer to the cross-entropy optimization [@rubinstein1999] as the best-fit method. It does not guarantee to find the best possible tree but often produces more accurate trees than the other methods.]
We train a fast-and-frugal tree on the Liver data set. When the first column of a data set contains the class labels that we want to predict, we can simply pass the data set as the first argument. We limit the size of the tree to at most four nodes. The greedy method is the default algorithm. It is fast and usually produces accurate trees.
library(ffcr) model <- fftree(liver, method = "greedy", max_depth = 4)
Alternatively, we can call the fftree function using the formula syntax. Here we train the fast-and-frugal tree using only a few selected features.
fftree(diagnosis ~ age + albumin + proteins + aspartate, data = liver, max_depth = 4)
Printing the model shows the structure of the tree and its fitting performance in the data set. Additionally to standard performance measures such as accuracy and the F1 score, the output also states the depth of the tree, the number of unique features that are split in the tree, and the frugality, which is the average number of nodes visited until a prediction is made.
print(model)
To visualize the tree we use
plot(model)
The sensitivity of this tree is very high, while the specificity is low---the tree nearly always predicts Liver disease. This produces an accurate tree when considering the number of total misclassifications as the relevant performance metric. To increase the specificity of the tree, we can weigh the observations such that both classes of patients get the same share. Let p be the proportion of patients that have liver disease. We weigh the patients with liver disease by 1-p, and the patients without disease by p. Doing that the specificity increases substantially:
p <- sum(liver$diagnosis == "Liver disease")/nrow(liver) weights <- c("No liver disease" = p, "Liver disease" = 1 - p) model <- fftree(liver, weights = weights, cv = TRUE, max_depth = 4) model
By default, the fftree method fits a fast-and-frugal tree to the complete data set. Here we have set 'cv = TRUE' to additionally estimate the predictive performance of the tree using 10-fold cross-validation.
To make predictions according to a fast-and-frugal tree, we can use the predict
function. It either returns the class label (response), or the performance across the observations (metric). Note that for the latter, the class labels need to be included in the data that is passed to the predict function.
model <- fftree(diagnosis ~ ., data = liver[1:300,], weights = c(1-p,p), max_depth = 4) predict(model, newdata = liver[301:310,], type = "response") predict(model, newdata = liver[301:nrow(liver),], type = "metric")
The package implements two different methods to train tallying models, basic and cross-entropy. As for fast-and-frugal trees, we set a maximum size (max_size) of four to obtain a simple model and weigh the observations to make sure that the tallying model strikes a good balance between high sensitivity and specificity. The predict function is used in the same way as it is for fast-and-frugal trees. <!--- The regression method only works on binary features. If the data contains numeric variables, these are split such that Gini impurity is minimized. On the dichotomized data, a Lasso regression model is trained. The tallying model does not use the actual weights of the regression model but only their sign (-1,1). The size of the tallying---the number of features with nonzero weight---is determined by the maximum_size parameter (default = 6). The tallying model users those features that are set to zero latest with an increasing degree of regularization.
The cross-entropy method is very similar to the cross-entropy method of the fast-and-frugal trees. It optimizes the threshold at which each feature is split, the direction of the feature and which features to include into the tallying model. As for fast-and-frugal trees, the optimization can be tweaked using the cross_entropy_control
function.
Training a tallying model and using it for prediction just works as it does for fast-and-frugal trees. --->
p <- sum(liver$diagnosis == "Liver disease")/nrow(liver) weights <- c("No liver disease" = p, "Liver disease" = 1 - p) model <- tally(diagnosis ~ ., data = liver[1:300,], weights = weights, max_size = 4) model predict(model, newdata = liver[301:nrow(liver),], type = "metric")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.