knitr::opts_chunk$set(echo = TRUE) library(rulefit) data(titanic, package="binnr")
devtools::install_git("https://GravesEE@gitlab.ins.risk.regn.net/minneapolis-r-packages/rulefit.git")
A RuleFit model uses a tree ensemble to generate its rules. As such, a tree ensemble model must be provided to the rulefit function. This funciton returns a RuleFit object which can be used to mine rules and train rule ensembles.
mod <- gbm.fit(titanic[-1], titanic$Survived, distribution="bernoulli", interaction.depth=3, shrinkage=0.1, verbose = FALSE) rf <- rulefit(mod, n.trees=100) print(rf)
the rulefit
function wraps a gbm model in a class that manages rule
construction and model fitting. The rules are generated immediately but the
model is not fit until the train
function is called.
head(rf$rules)
For ease of programming every internal node is generated -- even the root node. That is why the first rule listed above is empty. Root nodes are not splits. This was a design decision and does not affect how the package is used in practice.
Training a RuleFit model is as easy as calling the train method. The train
method uses the cv.glmnet
function from the glmnet
package and accepts all
of the same arguments.
Argument | Purpose ---|--- x | Dataset of predictors that should match what was used for training the ensemble. y | Target variable to train against. family | What is the distribution of the target? Binomial for 0/1 variables. alpha | Penatly mixing parameter. LASSO regression uses the default of 0. nfolds | How many k-folds to train the model with. Defaults to 5. dfmax | How many variables should the final model have? parallel | TRUE/FALSE to build kfold models in parallel. Requires a backend.
fit <- train(rf, titanic[-1], y = titanic$Survived, family="binomial")
Training the model on repeated, random samples with replacement can generate better parameter estimates. This is known as bagging.
library(doSNOW) cl <- makeCluster(3) registerDoSNOW(cl) fit <- train(rf, titanic[-1], y = titanic$Survived, bag = 20, parallel = TRUE, family="binomial") stopCluster(cl)
Once a RuleFit model is trained. Predictions can be produced by calling the
predict method. As with the train function, predict
also takes arguments
accepted by predict.cv.glmnet
. The most important of which is the lambda
parameter, s
. The default is to use s="lambda.min"
which minimizes the
out-of-fold error.
Both a score as well as a sparse matrix of rules can be predicted.
p_rf <- predict(fit, newx = titanic[-1], s="lambda.1se") head(p_rf)
The out-of-fold predictions can also be extracted if the model was trained with
keep=TRUE
. Again, this is working with the cv.glmnet
API. There is nothing
magical going on here:
p_val <- fit$fit$fit.preval[,match(fit$fit$lambda.1se, fit$fit$lambda)]
p_gbm <- predict(mod, titanic[-1], n.trees = gbm.perf(mod, plot.it = F)) roc_rf <- pROC::roc(titanic$Survived, -p_rf) roc_val <- pROC::roc(titanic$Survived, -p_val) roc_gbm <- pROC::roc(titanic$Survived, -p_gbm) plot(roc_rf) par(new=TRUE) plot(roc_val, col="blue") par(new=TRUE) plot(roc_gbm, col="red")
RuleFit also provides a summary method to inspect and measure the coverage of fitted rules.
fit_summary <- summary(fit, s="lambda.1se", dedup=TRUE) head(fit_summary)
Like other tree ensemble techniques, variable importance can be calculated. This is different than the rule importance. Variable importance corresponds to the input variables used to generate the rules.
imp <- importance(fit, titanic[-1], s="lambda.1se") plot(imp)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.