knitr::opts_chunk$set(echo = TRUE) library(C50) library(modeldata)

The `C50`

package contains an interface to the C5.0 classification model. The main two modes for this model are:

- a basic tree-based model
- a rule-based model

Many of the details of this model can be found in Quinlan (1993) although the model has new features that are described in Kuhn and Johnson (2013). The main public resource on this model comes from the RuleQuest website.

To demonstrate a simple model, we'll use the credit data that can be accessed in the `modeldata`

package:

library(modeldata) data(credit_data)

The outcome is in a column called `Status`

and, to demonstrate a simple model, the `Home`

and `Seniority`

predictors will be used.

vars <- c("Home", "Seniority") str(credit_data[, c(vars, "Status")]) # a simple split set.seed(2411) in_train <- sample(1:nrow(credit_data), size = 3000) train_data <- credit_data[ in_train,] test_data <- credit_data[-in_train,]

To fit a simple classification tree model, we can start with the non-formula method:

library(C50) tree_mod <- C5.0(x = train_data[, vars], y = train_data$Status) tree_mod

To understand the model, the `summary`

method can be used to get the default `C5.0`

command-line output:

```
summary(tree_mod)
```

A graphical method for examining the model can be generated by the `plot`

method:

```
plot(tree_mod)
```

A variety of options are outlines in the documentation for `C5.0Control`

function. Another option that can be used is the `trials`

argument which enables a boosting procedure. This method is model similar to AdaBoost than to more statistical approaches such as stochastic gradient boosting.

For example, using three iterations of boosting:

tree_boost <- C5.0(x = train_data[, vars], y = train_data$Status, trials = 3) summary(tree_boost)

Note that the counting is zero-based. The `plot`

method can also show a specific tree in the ensemble using the `trial`

option.

C5.0 can create an initial tree model then decompose the tree structure into a set of mutually exclusive rules. These rules can then be pruned and modified into a smaller set of *potentially* overlapping rules. The rules can be created using the `rules`

option:

rule_mod <- C5.0(x = train_data[, vars], y = train_data$Status, rules = TRUE) rule_mod summary(rule_mod)

Note that no pruning was warranted for this model.

There is no `plot`

method for rule-based models.

The `predict`

method can be used to get hard class predictions or class probability estimates (aka "confidence values" in documentation).

predict(rule_mod, newdata = test_data[1:3, vars]) predict(tree_boost, newdata = test_data[1:3, vars], type = "prob")

A cost-matrix can also be used to emphasize certain classes over others. For example, to get more of the "bad" samples correct:

cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2) rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good") cost_mat cost_mod <- C5.0(x = train_data[, vars], y = train_data$Status, costs = cost_mat) summary(cost_mod) # more samples predicted as "bad" table(predict(cost_mod, test_data[, vars])) # that previously table(predict(tree_mod, test_data[, vars]))

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.