Creating FFTs with FFTrees()

knitr::opts_chunk$set(collapse = FALSE, 
                      comment = "#>", 
                      prompt = FALSE,
                      tidy = FALSE,
                      echo = TRUE, 
                      message = FALSE,
                      warning = FALSE,
                      # Default figure options:
                      dpi = 100,
                      fig.align = 'center',
                      fig.height = 6.0,
                      fig.width  = 6.5, 
                      out.width = "580px")
library(FFTrees)

Details on the FFTrees() function

This vignette starts by building a fast-and-frugal tree (FFT) from the heartdisease data ---\ also used in the Tutorial: FFTs for heart disease and @phillips2017FFTrees\ --- but then explores additional aspects of the FFTrees() function.

The goal of the FFTrees() function is to create FFTs from data and record all details of the problem specification, tree definitions, the classification process, and performance measures in an FFTrees object. As FFTrees() can handle two sets of data (for training vs. testing) and creates a range of FFTs, each with distinct process and performance characteristics, evaluating the function may take some time (typically a few seconds) and the structure of the resulting FFTrees object is quite complex. But as FFTrees() is at the heart of the FFTrees package, it pays to understand its arguments and the structure of an\ FFTrees object.

Example: Predicting heart disease

An FFT generally addresses binary classification problems: It attempts to classify the outcomes of a criterion variable into one of two classes (i.e., True or False) based on a range of potential predictor variables (aka. cues or features). A corresponding problem from the domain of clinical diagnostics is:

knitr::include_graphics("../inst/CoronaryArtery.jpg")

To address this problem, the FFTrees package includes the heartdisease data. But rather than using the full dataset to fit our FFTs, we have split the data into a training set (heart.train), and test set (heart.test). Here is a peak at the corresponding data frames:

# Training data: 
head(heart.train)

# Testing data:
head(heart.test)

The critical dependent variable (or binary criterion variable) is diagnosis. This variable indicates whether a patient has heart disease (diagnosis = TRUE) or not (diagnosis = FALSE). All other variables in the dataset (e.g., sex, age, and several biological measurements) can be used as predictors (aka. cues).

Creating trees with FFTrees()

To illustrate the difference between fitting and prediction, we will train the FFTs on heart.train, and test their prediction performance in heart.test. Note that we can also automate the training / test split using the train.p argument in FFTrees(). Setting train.p will randomly split train.p\% of the original data into a training set.

To create a set of FFTs, we use the FFTrees() function to create a new FFTrees object called heart.fft. Here, we specify diagnosis as the binary criterion (or dependent variable), and include all other (independent) variables with formula = diagnosis ~ .:

# Create an FFTrees object called heart.fft predicting diagnosis: 
heart.fft <- FFTrees(formula = diagnosis ~.,
                     data = heart.train,
                     data.test = heart.test)

If we wanted to only consider specific variables, like\ sex and\ age, for our trees, we could specify formula = diagnosis ~ age + sex.

Elements of an FFTrees object

The FFTrees() function returns an object of the FFTrees class. There are many elements in an FFTrees object. We can obtain these elements by printing their names:

# See the elements of an FFTrees object:
names(heart.fft)

Inspecting these elements provides a wealth of information on the range of FFTs contained in the current FFTrees object:

  1. criterion_name: The name of the (predicted) binary criterion variable.

  2. cue_names: The names of all potential cue variables (predictors) in the data.

  3. formula: The formula used to create the FFTrees object.

  4. trees: Information on all trees contained in the object, with list elements that specify their number\ n, the best tree, as well as tree definitions, verbal descriptions (inwords), decisions, and the performance characteristics (stats and level_stats) of each FFT.

  5. data: The datasets used to train and test the FFTs (if applicable).

  6. params: The parameters used for constructing FFTs (currently r length(names(heart.fft$params)) parameters).

  7. competition: Models and statistics of alternative classification algorithms (for test, train, and models).

  8. cues: Information on all cue variables (predictors), with list elements that specify their thresholds and performance stats when training FFTs.

Basic performance characteristics of FFTs

We can view basic information about the FFTrees object by printing its name. As the default tree construction algorithm\ ifan creates multiple trees with different exit structures, an FFTrees object typically contains many FFTs.

When printing an FFTrees object, we automatically see the performance on the training data (i.e., for fitting, rather than prediction) and obtain the information about the tree with the highest value of the goal statistic. By default, the goal is set to weighed accuracy\ wacc:

# Training performance of the best tree (on "train" data, given current goal):
heart.fft  # same as: print(heart.fft, data = "train")

Before interpreting any model output, we need to carefully distinguish between an FFT's "Training" (for fitting training data) and "Prediction" performance (for new test data). Unless we explicitly ask for print(heart.fft, data = "test"), the output of printing heart.fft will report on the fitting phase (i.e., data = "train" by default). To see the corresponding prediction performance, we could alternatively ask for:

# Prediction performance of the best training tree (on "test" data): 
print(heart.fft, data = "test")

When evaluating an FFT for either training or test data, we obtain a wide range of measures.

After some general information on the FFTrees object, we see a verbal Definition of the best FFT (FFT\ #1). Key information for evaluating an FFT's performance is contained in the Accuracy panel (for either training or prediction). Here is a description of the frequency counts and corresponding statistics provided:

| Statistic | Long name | Definition | |:----- |:--------- |:----------------------------------| | Frequencies: | | | | hi | Number of hits | $N(\text{Decision} = 1 \land \text{Truth} = 1)$ | | mi | Number of misses | $N(\text{Decision} = 0 \land \text{Truth} = 1)$ | | fa | Number of false-alarms | $N(\text{Decision} = 1 \land \text{Truth} = 0)$ | | cr | Number of correct rejections | $N(\text{Decision} = 0 \land \text{Truth} = 0)$ | | N | Number of cases | The total number of cases considered. | | Probabilities: | | | | acc | Accuracy | The percentage of cases that were correctly classified. | | ppv | Positive predictive value | The percentage (or conditional probability) of positive decisions being correct (i.e., True\ + cases). | | npv | Negative predictive value | The percentage (or conditional probability) of negative decisions being correct (i.e., True\ - cases). | | wacc| Weighted accuracy | The weighted average of sensitivity and specificity, where sensitivity is weighted by sens.w (by default, sens.w = .50). | | sens| Sensitivity | The percentage (or conditional probability) of true positive cases being correctly classified. | | spec| Specificity | The percentage (or conditional probability) of true negative cases being correctly classified. | | Frugality: | | | | mcu | Mean cues used | On average, how many cues were needed to classify cases? In other words, what percent of the available information was used on average? | | pci | Percent cues ignored | The percent of data that was ignored when classifying cases with a given tree. This is identical to mcu / cues.n, where cues.n is the total number of cues in the data. |

Table 1: Description of FFTs' basic frequencies and corresponding accuracy and speed/frugality statistics.

To obtain the same information for another FFT of an\ FFTrees object, we can call print() with a numeric tree parameter. For instance, the following expression would provide the basic performance characteristics of Tree\ 3:

# Performance of alternative FFTs (Tree 3) in an FFTrees object: 
print(heart.fft, tree = 3, data = "test")

Alternatively, we could visualize the same tree and its performance characteristics by calling plot(heart.fft, tree = 3, data = "test").

See the Accuracy statistics vignette for details on the accuracy statistics used throughout the FFTrees package.

Cue performance information

Each FFT has a decision threshold for each cue (regardless of whether or not it is actually used in the tree) that maximizes the goal value of that cue when it is applied to the entire training dataset. We can obtain cue accuracy statistics using the calculated decision thresholds from the cue.accuracies list. If the object has test data, we can see the marginal cue accuracies in the test data (using the thresholds calculated from the training data):

# Decision thresholds and marginal classification training accuracies for each cue: 
heart.fft$cues$stats$train

We can also visualize the cue accuracies for the training data (in ROC\ space, i.e., showing each cue's hit rate by its false alarm rate) by calling plot() with the what = "cues" argument. This will show the sensitivities and specificities for each cue, with the top five cues highlighted and listed:

# Visualize individual cue accuracies: 
plot(heart.fft, what = "cues", 
     main = "Cue accuracy for heartdisease")

See the Plotting FFTrees objects vignette for more details on visualizing cue accuracies and FFTs.

Tree definitions

The tree.definitions data frame contains definitions (cues, classes, exits, thresholds, and directions) of all trees in an\ FFTrees object. The combination of these five pieces of information (as well as their order), define and describe how a tree makes decisions:

# See the definitions of all trees:
heart.fft$trees$definitions

Separate levels in tree definitions are separated by colons\ (;). To understand how to read these definitions, let's start by understanding Tree\ 1, the tree with the highest weighted accuracy during training:

From this information, we can understand and verbalize Tree\ 1 as follows:

  1. If thal is equal to either\ rd or\ fd, predict a positive criterion value.
  2. Otherwise, if cp is not equal to\ a, predict a negative value.
  3. Otherwise, if ca is greater than\ 0, predict a positive value,
    else predict a negative value.

Note that heart.fft$trees$definitions also reveals that Tree\ 3 and Tree\ 5 use the same cues and cue directions as Tree\ 1. However, they differ in the exit structures of the first and second cues (or nodes).

Applying the inwords() function to an FFTrees object returns a verbal description of a tree. For instance, to obtain a verbal description of the tree with the highest training accuracy (i.e., Tree\ #1), we can ask for:

# Describe the best training tree (i.e., Tree #1):
inwords(heart.fft, tree = 1)

Accuracy statistics of FFTs

The performance of an FFT on a specific dataset is characterized by a range of accuracy statistics. Here are the training statistics for all trees in heart.fft:

# Training statistics for all trees:
heart.fft$trees$stats$train

The corresponding statistics for the testing are:

# Testing statistics for all trees:
heart.fft$trees$stats$test

See the Accuracy statistics vignette for the definitions of accuracy statistics used throughout the FFTrees package.

Classification decisions

The decision list contains the raw classification decisions for each tree and each case, as well as detailed information on the costs of each classification.

For instance, here are the decisions made by Tree\ 1 on the training data:

# Inspect the decisions of Tree 1:
heart.fft$trees$decisions$train$tree_1

Predicting new cases with predict()

Once we have created an FFTrees object, we can use it to predict new data using predict().

First, we'll use the heart.fft object to make predictions for cases\ 1 through\ 10 of the heartdisease dataset. By default, the tree with the best training\ wacc values is used to predict the value of the binary criterion variable:

# Predict classes for new data from the best training tree: 
predict(heart.fft,
        newdata = heartdisease[1:10, ])

If we wanted to use an alternative FFT of an FFTrees object for predicting the criterion outcomes of new data, we could specify its number in the tree argument of the predict() function.

To predict class probabilities, we can include the type = "prob" argument. This will return a matrix of class predictions, where the first column indicates the probabilities for a case being classified as\ 0 / FALSE, and the second column indicates the probability for a case being classified as\ 1 / TRUE:

# Predict class probabilities for new data from the best training tree:
predict(heart.fft,
        newdata = heartdisease,
        type = "prob")

Use type = "both" to get both classification and probability predictions for cases:

# Predict both classes and probabilities:
predict(heart.fft,
        newdata = heartdisease,
        type = "both")

Visualising trees

Once we have created an FFTrees object using the FFTrees() function we can visualize the tree (and ROC\ curves) using the plot() function. The following code will visualize the best training tree applied to the test data:

plot(heart.fft,
     main = "Heart Disease",
     decision.labels = c("Healthy", "Disease"))

See the Plotting FFTrees objects vignette for more details on visualizing trees.

Manually defining an FFT

We can also design a specific FFT and apply it to a dataset by using the my.tree argument of FFTrees(). To do so, we specify the FFT as a sentence, making sure to correctly spell the cue names as they appear in the data. Sets of factor cues can be specified using (curly) brackets.

For example, we can manually define an FFT by using the sentence:

# Manually define a tree using the my.tree argument:
myheart_fft <- FFTrees(diagnosis ~., 
                       data = heartdisease, 
                       my.tree = "If chol > 300, predict True. 
                                  If thal = {fd,rd}, predict False. 
                                  Otherwise, predict True")

Here is a plot of the resulting FFT:

plot(myheart_fft, 
     main = "Specifying a manual FFT")

As we can see, the performance of this particular tree is pretty terrible ---\ but this should motivate you to build better FFTs yourself!

In addition to defining an FFT from its verbal description, we can edit and define sets of FFT definitions and evaluate them on data. See the Manually specifying FFTs vignette for details on editing, modifying, and evaluating specific FFTs.

Vignettes

Here is a complete list of the vignettes available in the FFTrees package:

| | Vignette | Description | |--:|:------------------------------|:-------------------------------------------------| | | Main guide: FFTrees overview | An overview of the FFTrees package | | 1 | Tutorial: FFTs for heart disease | An example of using FFTrees() to model heart disease diagnosis | | 2 | Accuracy statistics | Definitions of accuracy statistics used throughout the package | | 3 | Creating FFTs with FFTrees() | Details on the main FFTrees() function | | 4 | Manually specifying FFTs | How to directly create FFTs without using the built-in algorithms | | 5 | Visualizing FFTs | Plotting FFTrees objects, from full trees to icon arrays | | 6 | Examples of FFTs | Examples of FFTs from different datasets contained in the package |

References



Try the FFTrees package in your browser

Any scripts or data that you put into this service are public.

FFTrees documentation built on June 7, 2023, 5:56 p.m.