knitr::opts_chunk$set(echo = TRUE)
library(rulefit)
data(titanic, package="binnr")

Installation

devtools::install_git("https://GravesEE@gitlab.ins.risk.regn.net/minneapolis-r-packages/rulefit.git")

Usage

Creating a RuleFit Model

A RuleFit model uses a tree ensemble to generate its rules. As such, a tree ensemble model must be provided to the rulefit function. This funciton returns a RuleFit object which can be used to mine rules and train rule ensembles.

mod <- gbm.fit(titanic[-1], titanic$Survived, distribution="bernoulli",
  interaction.depth=3, shrinkage=0.1, verbose = FALSE)

rf <- rulefit(mod, n.trees=100)

print(rf)

the rulefit function wraps a gbm model in a class that manages rule construction and model fitting. The rules are generated immediately but the model is not fit until the train function is called.

head(rf$rules)

For ease of programming every internal node is generated -- even the root node. That is why the first rule listed above is empty. Root nodes are not splits. This was a design decision and does not affect how the package is used in practice.

Training

Training a RuleFit model is as easy as calling the train method. The train method uses the cv.glmnet function from the glmnet package and accepts all of the same arguments.

Common Arguments

Argument | Purpose ---|--- x | Dataset of predictors that should match what was used for training the ensemble. y | Target variable to train against. family | What is the distribution of the target? Binomial for 0/1 variables. alpha | Penatly mixing parameter. LASSO regression uses the default of 0. nfolds | How many k-folds to train the model with. Defaults to 5. dfmax | How many variables should the final model have? parallel | TRUE/FALSE to build kfold models in parallel. Requires a backend.

fit <- train(rf, titanic[-1], y = titanic$Survived, family="binomial")

Bagging

Training the model on repeated, random samples with replacement can generate better parameter estimates. This is known as bagging.

library(doSNOW)

cl <- makeCluster(3)
registerDoSNOW(cl)

fit <- train(rf, titanic[-1], y = titanic$Survived, bag = 20, parallel = TRUE, 
  family="binomial")

stopCluster(cl)

Predicting

Once a RuleFit model is trained. Predictions can be produced by calling the predict method. As with the train function, predict also takes arguments accepted by predict.cv.glmnet. The most important of which is the lambda parameter, s. The default is to use s="lambda.min" which minimizes the out-of-fold error.

Both a score as well as a sparse matrix of rules can be predicted.

p_rf <- predict(fit, newx = titanic[-1], s="lambda.1se")

head(p_rf)

The out-of-fold predictions can also be extracted if the model was trained with keep=TRUE. Again, this is working with the cv.glmnet API. There is nothing magical going on here:

p_val <- fit$fit$fit.preval[,match(fit$fit$lambda.1se, fit$fit$lambda)]

Comparing RuleFit dev & val to GBM

p_gbm <- predict(mod, titanic[-1], n.trees = gbm.perf(mod, plot.it = F))

roc_rf <- pROC::roc(titanic$Survived, -p_rf)
roc_val <- pROC::roc(titanic$Survived, -p_val)
roc_gbm <- pROC::roc(titanic$Survived, -p_gbm)

plot(roc_rf)
par(new=TRUE)
plot(roc_val, col="blue")
par(new=TRUE)
plot(roc_gbm, col="red")

Rule Summary

RuleFit also provides a summary method to inspect and measure the coverage of fitted rules.

fit_summary <- summary(fit, s="lambda.1se", dedup=TRUE)
head(fit_summary)

Variable Importance

Like other tree ensemble techniques, variable importance can be calculated. This is different than the rule importance. Variable importance corresponds to the input variables used to generate the rules.

imp <- importance(fit, titanic[-1], s="lambda.1se")
plot(imp)


Zelazny7/rulefit documentation built on May 14, 2019, 8:20 a.m.