Introduction to the Package tree.bins"

knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)
#knitr::opts_chunk$set(error = TRUE)
library(dplyr)
library(rpart)
library(ggplot2)
library(tree.bins)

Introduction

When working with large data sets, there may be a need to recategorize the factors by some criterion. The tree.bins package allows users to recategorize these variables through a decision tree method derived from the rpart() function of the rpart library. The tree.bins() function is especially useful if the data set contains several factor class variables, which many of those variables contain an abnormal amount of levels. The intended purpose of the library is to recategorize predictors in order to limit the number of dummy variables created when applying a statistical method to model a response. This document illustrates a typical problem where the tree.bins library would be used and how it would be used.

Pre-Categorization: Typical Variable for Consideration

This section illustrates a typical variable that could be considered for recategorization

Visualization of Candidate Variable

I use a subset of the Ames data set to illustrate. The below chunk illustrates the average home sale price of each Neighborhood.

AmesSubset %>% 
  select(SalePrice, Neighborhood) %>% 
  group_by(Neighborhood) %>% 
  summarise(AvgPrice = mean(SalePrice)/1000) %>% 
  ggplot(aes(x = reorder(Neighborhood, -AvgPrice), y = AvgPrice, fill = Neighborhood)) +
  geom_bar(stat = "identity") + 
  labs(x = "Neighborhoods", y = "Avg Price (in thousands)", 
       title = paste0("Average Home Prices of Neighborhoods") , fill = "Neighborhoods") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Notice that many neighborhoods observe the same average sale price. This indicates that we could recategorize the neighborhoods variable into fewer levels.

Statistical Method Implementation of Candidate Variable

The following illustrates the results of using a statistical learning method – linear regression for this example – on a categorical variable with several levels.

fit <- lm(formula = SalePrice ~ Neighborhood, data = AmesSubset)
summary(fit)

Notice that there are multiple dummy variables being created to capture the different levels found within the Neighborhoods variable.

Visualizing the Bins Created by a Decision Tree

The below steps illustrate how rpart() categorizes the different levels of Neighborhoods into the separate leaves. These leaves are used to generate the mappings extracted within tree.bins() to recategorize the current data.

d.tree = rpart(formula = SalePrice ~ Neighborhood, data = AmesSubset)
rpart.plot::rpart.plot(d.tree)

These 5 categories is what tree.bins() will use to recategorize the variable Neighborhood.

Post-Categorization: Typical Variable for Consideration

This section illustrates the result of using tree.bins() to recategorize a typical variable.

Recategorization of Candidate Variable

Continuing from the above example, we can clearly identify that there are similarities in many of the levels within the Neighborhoods variable in relation to the response. To limit the number of dummy variables that are created in a statistical learning method, we would like to group the categories that display similar associations with the responses into one bin. We could create visualizations to identify these similarities in levels for each variable, but it would an extremely tedious task not to mention subjective to the analyst.

A better method would be to use the rules that are generated from a decision tree. This can be accomplished by using the rpart() function in the rpart library. However, this task remains tedious, especially when there are numerous factor class variables. The tree.bins() function allows the user to iteratively recategorize each factor level variable for the specified data set.

sample.df <- AmesSubset %>% select(Neighborhood, MS.Zoning, SalePrice)
binned.df <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .01), return = "new.fctrs")
levels(sample.df$Neighborhood) #current levels of Neighborhood
unique(binned.df$Neighborhood) #new levels of Neighborhood

The Different Return Options of tree.bins()

Depending on what is the most useful information to the user, tree.bins() can return either the recategorized data.frame or a list comprised of lookup tables. The lookup tables contain the old to new value mappings generated by tree.bins().

The "new.fctrs" returns the recategorized data.frame

head(binned.df)

The "lkup.list" returns a list of the lookup tables

lookup.list <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .01), return = "lkup.list")
head(lookup.list[[1]])

Using the bin.oth() Function

Using tree.bins() the user will be able to recategorize factor class variables of one particular data.frame. Let’s assume, that down the road, they obtain a similar dataset that contains the same old categorical convention. In this case, a user may want to recategorize this new data.frame by the same lookup tables that were generated from the first data.frame. In this case, being able to bin other data.frames with the same lookup table would be quite useful. The example below takes in a subset of the AmesSubset data and returns a data.frame recategorized by the lookup list generated from the tree.bins() function.

oth.binned.df <- bin.oth(list = lookup.list, data = sample.df)
head(oth.binned.df)


Try the tree.bins package in your browser

Any scripts or data that you put into this service are public.

tree.bins documentation built on May 2, 2019, 12:20 p.m.