library("tidyrules") library("dplyr") library("C50") library("pander") # build model c5_model <- C5.0(Species ~ ., data = iris, rules = TRUE) # extract rules in a tidy tibble tidy_rules <- tidyRules(c5_model) # View tidy_rules tidy_rules %>% select(-c(rule_number,trial_number)) %>% pandoc.table()
Filter rules based on RHS
or support
or confidence
or lift
:
# Example 1, filter rules based on support tidy_rules %>% filter(support >= 48) %>% select(LHS, RHS) # Example 2, filter rules based on RHS tidy_rules %>% filter(RHS == "virginica") %>% select(LHS, support, confidence, lift)
Use a tidyrule
in a filter()
function :
iris %>% filter(eval(parse(text = tidy_rules[3,"LHS"]))) %>% # filter using a C5 rule count(Species)
tidyrules
C5.0
In this example we use attrition
data from rsample
package. This
illustration shows how to extract rules from C5.0
model and applying filter()
based on tidyrules.
# loading packages library("tidyrules") library("C50") library("dplyr") # attrition data load data("attrition", package = "modeldata") attrition <- as_tibble(attrition) glimpse(attrition)
As you could see, there are 31 variables and 1470 observations are present this
data-set. Here our aim is to predict Attrition using rest of the variables. Let
us build a C5.0
model first.
# our C5 model c5_att <- C5.0(Attrition ~ ., data = attrition, rules = TRUE) # sample rules from C5 c5_att$output %>% stringr::str_sub(start = 194L , end = 578L) %>% writeLines()
We get nice and human readable rules. Now problem with C5.0
summary is, you
can only read and get a feel of how your predictions made based on rules. But
here comes the hard part, imagine if you want to explore further about your data
and you want to dig deeper, if you want to know rules which are throwing high
lift and confidence, or you may be interested in rules which covers major
sub-population. If in case your model is giving too many rules then that is the
hardest part to go through each and every rules and identifying best rules out
of the summary.
What if we have all the rules in a tidy table format so that we could easily use
them on the data. Let's get it done using tidyRules
.
# Extract rules to a tidy tibble tr_att <- tidyRules(c5_att) tr_att
tidyRules
important columns to notice :
LHS
: Rules.RHS
: Predicted Class. support
: Number of observation covered by the rule. confidence
: Prediction accuracy for respective class. (laplace correction is implemented by default)lift
: The result of dividing the rule's estimated accuracy by the relative frequency of the predicted class in the training set.Let's have a look at first five rules
tr_att %>% head(5) %>% select(LHS,RHS) %>% pandoc.table(split.cells = 60)
Now, all the rules are in tibble
(a tidy form of dataframe
) format. Let us
look at rules which favors only Attrition is equal to "No" and arrange by
support.
rules_example_1 <- tr_att %>% filter(RHS == "No") %>% arrange(desc(support)) rules_example_1
filter()
function.Let's use a rule within a filter()
. Say, one need to pick a rule which has
largest support
for predicted Attrition "Yes".
# filter a rule with conditions large_support_rule <- tr_att %>% filter(RHS == "Yes") %>% top_n(1, wt = support) %>% pull(LHS) # parseable rule parseable_rule <- parse(text = large_support_rule) # apply filter on data frame using parseable rule attrition %>% filter(eval(parseable_rule))
tr_att_python <- tidyRules(c5_att, language = "python") tr_att_sql <- tidyRules(c5_att, language = "sql") head(tr_att_python$LHS) head(tr_att_sql$LHS)
rpart
In this example we will be using BreastCancer
data from mlbench
package.
library("tidyrules") library("dplyr") library("rpart") # BreastCancer data(BreastCancer, package = "mlbench") bc_train <- BreastCancer %>% select(-Id) %>% mutate_if(is.ordered, function(x) x <- factor(x,ordered = F)) rpart_bc <- rpart(Class ~ ., data = bc_train)
NOTE : Do not forget to convert all ordered
features to factor
type
before training the model.
One could visualize rpart decision tree using prp
function from rpart.plot
package.
library("rpart.plot") prp(rpart_bc)
The above tree visual is really nice to get a hang of how decision tree is splitting at each node. But, if you want to pick a terminal node it is really boring and hard since one has to enter the respective filter manually (imagine a situation if you have hundreds of features and a huge tree!!). To get-ride of this problem one could use tidyrules to make life easier.
Let's extract rules from rpart
object and use those rules further more to extract terminal nodes.
# tidyrule extract rules_bc <- tidyRules(rpart_bc) rules_bc # filter the data using a rule bc_train %>% filter(eval(parse(text = rules_bc[5,"LHS"]))) %>% as_tibble()
Cubist
In this example, rules extraction from a regression model (a cubist
model) has
been illustrated below. We will be using AmesHousing
dataset for the example.
library("tidyrules") library("dplyr") library("Cubist") # ames housing data set ames <- AmesHousing::make_ames() cubist_ames <- cubist(x = ames[, setdiff(colnames(ames), c("Sale_Price"))], y = log10(ames[["Sale_Price"]]), committees = 3 ) # rule extract rules_ames <- tidyRules(cubist_ames) rules_ames
Notice that, for cubist
rules we have mean
, min
, max
and error
instead
of confidence
and lift
. Here mean
, min
and max
are calculated based on predicted values with respect to a rule.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.