treezy

Travis-CI Build StatusAppVeyor Build StatusCoverage Statuslifecycle

Makes handling output from decision trees easy. Treezy.

Decision trees are a commonly used tool in statistics and data science, but sometimes getting the information out of them can be a bit tricky, and can make other operations in a pipeline difficult.

treezy makes it easy to:

The data structures created in treezy - importance_table are making their way over to the broomstick package - a member of the broom family specifically focussing on decision trees, which gives different output to many of the (many!) packages/analyses that broom deals with. I am interested in feedback, so please feel free to file an issue if you have any problems!

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)

Installation

# install.packages("remotes")
remotes::install_github("njtierney/treezy")

Example usage

Explore variable importance with importance_table and importance_plot

rpart

library(treezy)
library(rpart)

fit_rpart_kyp <- rpart(Kyphosis ~ ., data = kyphosis)
# default method for looking at importance

# variable importance
fit_rpart_kyp$variable.importance

# with treezy

importance_table(fit_rpart_kyp)

importance_plot(fit_rpart_kyp)

# extend and modify
library(ggplot2)
importance_plot(fit_rpart_kyp) + 
    theme_bw() + 
    labs(title = "My Importance Scores",
         subtitle = "For a CART Model")

randomForest

library(randomForest)
set.seed(131)
fit_rf_ozone <- randomForest(Ozone ~ ., 
                             data = airquality, 
                             mtry=3,
                             importance=TRUE, 
                             na.action=na.omit)

fit_rf_ozone

## Show "importance" of variables: higher value mean more important:

# randomForest has a better importance method than rpart
importance(fit_rf_ozone)

## use importance_table
importance_table(fit_rf_ozone)

# now plot it
importance_plot(fit_rf_ozone)

Calculate residual sums of squares for rpart and randomForest

# CART
rss(fit_rpart_kyp)

# randomForest
rss(fit_rf_ozone)

plot partial effects

Using gbm.step from dismo package

# using gbm.step from the dismo package
library(gbm)
library(dismo)
# load data
data(Anguilla_train)

anguilla_train <- Anguilla_train[1:200,]

# fit model
angaus_tc_5_lr_01 <- gbm.step(data = anguilla_train,
                              gbm.x = 3:14,
                              gbm.y = 2,
                              family = "bernoulli",
                              tree.complexity = 5,
                              learning.rate = 0.01,
                              bag.fraction = 0.5)
gg_partial_plot(angaus_tc_5_lr_01,
                var = c("SegSumT",
                        "SegTSeas"))

Known issues

Future work

Acknowledgements

Credit for the name, "treezy", goes to @MilesMcBain, thanks Miles!



njtierney/treezy documentation built on Oct. 10, 2019, 1:08 a.m.