library("tidyrules") library("dplyr") library("C50") library("pander") # build model c5_model <- C5.0(Species ~ ., data = iris, rules = TRUE) # extract rules in a tidy tibble tidy_rules <- tidyRules(c5_model) # View tidy_rules tidy_rules %>% select(-c(rule_number,trial_number)) %>% pandoc.table()
Filter rules based on
# Example 1, filter rules based on support tidy_rules %>% filter(support >= 48) %>% select(LHS, RHS) # Example 2, filter rules based on RHS tidy_rules %>% filter(RHS == "virginica") %>% select(LHS, support, confidence, lift)
tidyrule in a
filter() function :
iris %>% filter(eval(parse(text = tidy_rules[3,"LHS"]))) %>% # filter using a C5 rule count(Species)
In this example we use
attrition data from
rsample package. This
illustration shows how to extract rules from
C5.0 model and applying
based on tidyrules.
# loading packages library("tidyrules") library("C50") library("dplyr") # attrition data load data("attrition", package = "modeldata") attrition <- as_tibble(attrition) glimpse(attrition)
As you could see, there are 31 variables and 1470 observations are present this
data-set. Here our aim is to predict Attrition using rest of the variables. Let
us build a
C5.0 model first.
# our C5 model c5_att <- C5.0(Attrition ~ ., data = attrition, rules = TRUE) # sample rules from C5 c5_att$output %>% stringr::str_sub(start = 194L , end = 578L) %>% writeLines()
We get nice and human readable rules. Now problem with
C5.0 summary is, you
can only read and get a feel of how your predictions made based on rules. But
here comes the hard part, imagine if you want to explore further about your data
and you want to dig deeper, if you want to know rules which are throwing high
lift and confidence, or you may be interested in rules which covers major
sub-population. If in case your model is giving too many rules then that is the
hardest part to go through each and every rules and identifying best rules out
of the summary.
What if we have all the rules in a tidy table format so that we could easily use
them on the data. Let's get it done using
# Extract rules to a tidy tibble tr_att <- tidyRules(c5_att) tr_att
tidyRules important columns to notice :
RHS: Predicted Class.
support: Number of observation covered by the rule.
confidence: Prediction accuracy for respective class. (laplace correction is implemented by default)
lift: The result of dividing the rule's estimated accuracy by the relative frequency of the predicted class in the training set.
Let's have a look at first five rules
tr_att %>% head(5) %>% select(LHS,RHS) %>% pandoc.table(split.cells = 60)
Now, all the rules are in
tibble (a tidy form of
dataframe) format. Let us
look at rules which favors only Attrition is equal to "No" and arrange by
rules_example_1 <- tr_att %>% filter(RHS == "No") %>% arrange(desc(support)) rules_example_1
Let's use a rule within a
filter(). Say, one need to pick a rule which has
support for predicted Attrition "Yes".
# filter a rule with conditions large_support_rule <- tr_att %>% filter(RHS == "Yes") %>% top_n(1, wt = support) %>% pull(LHS) # parseable rule parseable_rule <- parse(text = large_support_rule) # apply filter on data frame using parseable rule attrition %>% filter(eval(parseable_rule))
tr_att_python <- tidyRules(c5_att, language = "python") tr_att_sql <- tidyRules(c5_att, language = "sql") head(tr_att_python$LHS) head(tr_att_sql$LHS)
In this example we will be using
BreastCancer data from
library("tidyrules") library("dplyr") library("rpart") # BreastCancer data(BreastCancer, package = "mlbench") bc_train <- BreastCancer %>% select(-Id) %>% mutate_if(is.ordered, function(x) x <- factor(x,ordered = F)) rpart_bc <- rpart(Class ~ ., data = bc_train)
NOTE : Do not forget to convert all
ordered features to
before training the model.
One could visualize rpart decision tree using
prp function from
The above tree visual is really nice to get a hang of how decision tree is splitting at each node. But, if you want to pick a terminal node it is really boring and hard since one has to enter the respective filter manually (imagine a situation if you have hundreds of features and a huge tree!!). To get-ride of this problem one could use tidyrules to make life easier.
Let's extract rules from
rpart object and use those rules further more to extract terminal nodes.
# tidyrule extract rules_bc <- tidyRules(rpart_bc) rules_bc # filter the data using a rule bc_train %>% filter(eval(parse(text = rules_bc[5,"LHS"]))) %>% as_tibble()
In this example, rules extraction from a regression model (a
cubist model) has
been illustrated below. We will be using
AmesHousing dataset for the example.
library("tidyrules") library("dplyr") library("Cubist") # ames housing data set ames <- AmesHousing::make_ames() cubist_ames <- cubist(x = ames[, setdiff(colnames(ames), c("Sale_Price"))], y = log10(ames[["Sale_Price"]]), committees = 3 ) # rule extract rules_ames <- tidyRules(cubist_ames) rules_ames
Notice that, for
cubist rules we have
max are calculated based on predicted values with respect to a rule.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.