options(digits = 3) knitr::opts_chunk$set(echo = TRUE, fig.width = 7.5, fig.height = 7.5, dpi = 100, out.width = "600px", fig.align='center', message = FALSE)

```
library(FFTrees)
```

The `FFTrees()`

function is at the heart of the `FFTrees`

package. The function takes a training dataset as an argument, and generates several fast-and-frugal trees which attempt to classify cases into one of two classes (True or False) based on cues (aka., features).

knitr::include_graphics("../inst/CoronaryArtery.jpg")

We'll create FFTrees for heartdisease diagnosis data. The full dataset is stored as `heartdisease`

. For modelling purposes, I've split the data into a training (`heart.train`

), and test (`heart.test`

) dataframe. Here's how they look:

# Training data head(heartdisease) # Test data head(heartdisease)

The critical dependent variable is `diagnosis`

which indicates whether a patient has heart disease (`diagnosis = 1`

) or not (`diagnosis = 0`

). The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors (aka., cues).

`FFTrees()`

We will train the FFTs on `heart.train`

, and test their prediction performance in `heart.test`

. Note that you can also automate the training / test split using the `train.p`

argument in `FFTrees()`

. This will randomly split `train.p`

\% of the original data into a training set.

To create a set of FFTs, use `FFTrees()`

. We'll create a new FFTrees object called `heart.fft`

using the `FFTrees()`

function. We'll specify `diagnosis`

as the (binary) dependent variable, and include all independent variables with `formula = diagnosis ~ .`

# Create an FFTrees object called heart.fft predicting diagnosis heart.fft <- FFTrees(formula = diagnosis ~., data = heart.train, data.test = heart.test)

- If we wanted to only consider specific variables, like sex and age, for the trees we could do this by specifying
`formula = diagnosis ~ age + sex`

`FFTrees()`

returns an object with the FFTrees class. There are many elements in an FFTrees object, here are their names:

# Print the names of the elements of an FFTrees object names(heart.fft)

`formula`

: The formula used to create the FFTrees object.`data.desc`

: Basic information about the datasets.`cue.accuracies`

: Thresholds and marginal accuracies for each cue.`tree.definitions`

: Definitions of all trees in the object.`tree.stats`

: Classification statistics for all trees (tree definitions are also included here).`level.stats`

: Cumulative classification statistics for each level of each tree.`decision`

: Classification decisions for each case (row) for each tree (column).`levelout`

: The level at which each case (row) is classified for each tree (column).`auc`

: Area under the curve statistics`params`

: Parameters used in tree construction`comp`

: Models and statistics for alternative classification algorithms.

You can view basic information about the FFTrees object by printing its name. The default tree construction algorithm `ifan`

creates multiple trees with different exit structures. When printing an FFTrees object, you will see information about the tree with the highest value of the `goal`

statistic. By default, `goal`

is weighed accuracy `wacc`

:

```
# Print the object, with details about the tree with the best training wacc values
heart.fft
```

Here is a description of each statistic:

| statistic| long name | definition|
|:-----|:---------|:----------------------------------|
| `n`

|N | Number of cases|
| `mcu`

| Mean cues used| On average, how many cues were needed to classify cases? In other words, what percent of the available information was used on average.|
| `pci`

| Percent cues ignored| The percent of data that was *ignored* when classifying cases with a given tree. This is identical to the `mcu / cues.n`

, where `cues.n`

is the total number of cues in the data.|
| `sens`

| Sensitivity| The percentage of true positive cases correctly classified.|
| `spec`

| Specificity| The percentage of true negative cases correctly classified.|
| `acc`

| Accuracy | The percentage of cases that were correctly classified.|
| `wacc`

| Weighted Accuracy |Weighted average of sensitivity and specificity, where sensitivity is weighted by `sens.w`

(by default, `sens.w = .5`

) |

Each tree has a decision threshold for each cue (regardless of whether or not it is actually used in the tree) that maximizes the `goal`

value of that cue when it is applied to the entire training dataset. You can obtain cue accuracy statistics using the calculated decision thresholds from the `cue.accuracies`

list. If the object has test data, you can see the marginal cue accuracies in the test dataset (using the thresholds calculated from the training data):

# Show decision thresholds and marginal classification training accuracies for each cue heart.fft$cues$stats$train

You can also view the cue accuracies in an ROC plot with `plot()`

combined with the `what = "cues"`

argument. This will show the sensitivities and specificities for each cue, with the top 5 cues highlighted.

# Visualize individual cue accuracies plot(heart.fft, main = "Heartdisease Cue Accuracy", what = "cues")

The `tree.definitions`

dataframe contains definitions (cues, classes, exits, thresholds, and directions) of all trees in the object. The combination of these 5 pieces of information (as well as their order), define how a tree makes decisions.

# Print the definitions of all trees heart.fft$trees$definitions

To understand how to read these definitions, let's start by understanding tree `r heart.fft$trees$best$train`

, the tree with the highest training weighted accuracy

Separate levels in tree definitions are separated by colons `;`

. For example, tree 4 has 3 cues in the order `thal`

, `cp`

, `ca`

. The classes of the cues are `c`

(character), `c`

and `n`

(numeric). The decision exits for the cues are 1 (positive), 0 (negative), and 0.5 (both positive and negative). This means that the first cue only makes positive decisions, the second cue only makes negative decisions, and the third cue makes *both* positive and negative decisions.

The decision thresholds are `rd`

and `fd`

for the first cue, `a`

for the second cue, and `0`

for the third cue while the cue directions are `=`

for the first cue, `=`

for the second cue, and `>`

for the third cue. Note that cue directions indicate how the tree *would* make positive decisions *if* it had a positive exit for that cue. If the tree has a positive exit for the given cue, then cases that satisfy this threshold and direction are classified as positive. However, if the tree has only a negative exit for a given cue, then cases that do *not* satisfy the given thresholds are classified as negative.

From this, we can understand tree #4 verbally as follows:

*If thal is equal to either rd or fd, predict positive.*
*Otherwise, if cp is not equal to a, predict negative.*
*Otherwise, if ca is greater than 0, predict positive, otherwise, predict negative.*

You can use the `inwords()`

function to automatically return a verbal description of the tree with the highest training accuracy in an FFTrees object:

# Describe the best training tree inwords(heart.fft, tree = 1)

Here are the training statistics for all trees

# Print training statistics for all trees heart.fft$trees$stats$train

The `decision`

list contains the raw classification decisions for each tree for each case.

Here are is how decisions were made based on tree 1

# Look at the tree decisisions heart.fft$trees$decisions$train$tree_1

`predict()`

Once you've created an FFTrees object, you can use it to predict new data using `predict()`

. In this example, I'll use the `heart.fft`

object to make predictions for cases 1 through 50 in the heartdisease dataset. By default, the tree with the best training `wacc`

values is used.

# Predict classes for new data from the best training tree predict(heart.fft, newdata = heartdisease[1:10,])

To predict class probabilities, include the `type = "prob"`

argument, this will return a matrix of class predictions, where the first column indicates 0 / FALSE, and the second column indicates 1 / TRUE.

# Predict class probabilities for new data from the best training tree predict(heart.fft, newdata = heartdisease, type = "prob")

Use type = "both" to get both classification and probability predictions for cases

# Predict classes and probabilities predict(heart.fft, newdata = heartdisease, type = "both")

- See the vignette Plotting FFTrees objects for more details on visualizing trees.

Once you've created an FFTrees object using `FFTrees()`

you can visualize the tree (and ROC curves) using `plot()`

. The following code will visualize the best training tree applied to the test data:

plot(heart.fft, main = "Heart Disease", decision.labels = c("Healthy", "Disease"))

`my.tree`

- For complete details on specifying an FFT with
`my.tree`

, look at the vignette Specifying FFTs directly.

You can also define a specific FFT to apply to a dataset using the `my.tree`

argument. To do so, specify the FFT as a sentence, making sure to spell the cue names correctly as the appear in the data. Specify sets of factor cues using brackets. In the example below, I'll manually define an FFT using the sentence `"If chol > 300, predict True. If thal = {fd,rd}, predict False. Otherwise, predict True"`

# Define a tree manually using the my.tree argument myheart.fft <- FFTrees(diagnosis ~., data = heartdisease, my.tree = "If chol > 300, predict True. If thal = {fd,rd}, predict False. Otherwise, predict True") # Here is the result plot(myheart.fft, main = "Specifying an FFT manually")

As you can see, this FFT was pretty terrible

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.