knitr::opts_chunk$set(echo = TRUE) library(rulefit) data(titanic, package="binnr")
devtools::install_git("https://GravesEE@gitlab.ins.risk.regn.net/minneapolis-r-packages/rulefit.git")
A RuleFit model uses a tree ensemble to generate its rules. As such, a tree ensemble model must be provided to the rulefit function. This funciton returns a RuleFit object which can be used to mine rules and train rule ensembles.
mod <- gbm.fit(titanic[-1], titanic$Survived, distribution="bernoulli", interaction.depth=3, shrinkage=0.1, verbose = FALSE) rf <- rulefit(mod, n.trees=100) print(rf)
the rulefit function wraps a gbm model in a class that manages rule
construction and model fitting. The rules are generated immediately but the
model is not fit until the train function is called.
head(rf$rules)
For ease of programming every internal node is generated -- even the root node. That is why the first rule listed above is empty. Root nodes are not splits. This was a design decision and does not affect how the package is used in practice.
Training a RuleFit model is as easy as calling the train method. The train
method uses the cv.glmnet function from the glmnet package and accepts all
of the same arguments.
Argument | Purpose ---|--- x | Dataset of predictors that should match what was used for training the ensemble. y | Target variable to train against. family | What is the distribution of the target? Binomial for 0/1 variables. alpha | Penatly mixing parameter. LASSO regression uses the default of 0. nfolds | How many k-folds to train the model with. Defaults to 5. dfmax | How many variables should the final model have? parallel | TRUE/FALSE to build kfold models in parallel. Requires a backend.
fit <- train(rf, titanic[-1], y = titanic$Survived, family="binomial")
Training the model on repeated, random samples with replacement can generate better parameter estimates. This is known as bagging.
library(doSNOW) cl <- makeCluster(3) registerDoSNOW(cl) fit <- train(rf, titanic[-1], y = titanic$Survived, bag = 20, parallel = TRUE, family="binomial") stopCluster(cl)
Once a RuleFit model is trained. Predictions can be produced by calling the
predict method. As with the train function, predict also takes arguments
accepted by predict.cv.glmnet. The most important of which is the lambda
parameter, s. The default is to use s="lambda.min" which minimizes the
out-of-fold error.
Both a score as well as a sparse matrix of rules can be predicted.
p_rf <- predict(fit, newx = titanic[-1], s="lambda.1se") head(p_rf)
The out-of-fold predictions can also be extracted if the model was trained with
keep=TRUE. Again, this is working with the cv.glmnet API. There is nothing
magical going on here:
p_val <- fit$fit$fit.preval[,match(fit$fit$lambda.1se, fit$fit$lambda)]
p_gbm <- predict(mod, titanic[-1], n.trees = gbm.perf(mod, plot.it = F)) roc_rf <- pROC::roc(titanic$Survived, -p_rf) roc_val <- pROC::roc(titanic$Survived, -p_val) roc_gbm <- pROC::roc(titanic$Survived, -p_gbm) plot(roc_rf) par(new=TRUE) plot(roc_val, col="blue") par(new=TRUE) plot(roc_gbm, col="red")
RuleFit also provides a summary method to inspect and measure the coverage of fitted rules.
fit_summary <- summary(fit, s="lambda.1se", dedup=TRUE) head(fit_summary)
Like other tree ensemble techniques, variable importance can be calculated. This is different than the rule importance. Variable importance corresponds to the input variables used to generate the rules.
imp <- importance(fit, titanic[-1], s="lambda.1se") plot(imp)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.