knitr::opts_chunk$set(collapse = T, comment = "#>") options(tibble.print_min = 4L, tibble.print_max = 4L) #knitr::opts_chunk$set(error = TRUE) library(dplyr) library(rpart) library(ggplot2) library(tree.bins)
When working with large data sets, there may be a need to recategorize the factors by some criterion. The tree.bins package allows users to recategorize these variables through a decision tree method derived from the rpart() function of the rpart library. The tree.bins() function is especially useful if the data set contains several factor class variables, which many of those variables contain an abnormal amount of levels. The intended purpose of the library is to recategorize predictors in order to limit the number of dummy variables created when applying a statistical method to model a response. This document illustrates a typical problem where the tree.bins library would be used and how it would be used.
This section illustrates a typical variable that could be considered for recategorization
I use a subset of the Ames data set to illustrate. The below chunk illustrates the average home sale price of each Neighborhood.
AmesSubset %>% select(SalePrice, Neighborhood) %>% group_by(Neighborhood) %>% summarise(AvgPrice = mean(SalePrice)/1000) %>% ggplot(aes(x = reorder(Neighborhood, -AvgPrice), y = AvgPrice, fill = Neighborhood)) + geom_bar(stat = "identity") + labs(x = "Neighborhoods", y = "Avg Price (in thousands)", title = paste0("Average Home Prices of Neighborhoods") , fill = "Neighborhoods") + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Notice that many neighborhoods observe the same average sale price. This indicates that we could recategorize the neighborhoods variable into fewer levels.
The following illustrates the results of using a statistical learning method – linear regression for this example – on a categorical variable with several levels.
fit <- lm(formula = SalePrice ~ Neighborhood, data = AmesSubset) summary(fit)
Notice that there are multiple dummy variables being created to capture the different levels found within the Neighborhoods variable.
The below steps illustrate how rpart() categorizes the different levels of Neighborhoods into the separate leaves. These leaves are used to generate the mappings extracted within tree.bins() to recategorize the current data.
d.tree = rpart(formula = SalePrice ~ Neighborhood, data = AmesSubset) rpart.plot::rpart.plot(d.tree)
These 5 categories is what tree.bins() will use to recategorize the variable Neighborhood.
This section illustrates the result of using tree.bins() to recategorize a typical variable.
Continuing from the above example, we can clearly identify that there are similarities in many of the levels within the Neighborhoods variable in relation to the response. To limit the number of dummy variables that are created in a statistical learning method, we would like to group the categories that display similar associations with the responses into one bin. We could create visualizations to identify these similarities in levels for each variable, but it would an extremely tedious task not to mention subjective to the analyst.
A better method would be to use the rules that are generated from a decision tree. This can be accomplished by using the rpart() function in the rpart library. However, this task remains tedious, especially when there are numerous factor class variables. The tree.bins() function allows the user to iteratively recategorize each factor level variable for the specified data set.
sample.df <- AmesSubset %>% select(Neighborhood, MS.Zoning, SalePrice) binned.df <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .01), return = "new.fctrs") levels(sample.df$Neighborhood) #current levels of Neighborhood unique(binned.df$Neighborhood) #new levels of Neighborhood
Depending on what is the most useful information to the user, tree.bins() can return either the recategorized data.frame or a list comprised of lookup tables. The lookup tables contain the old to new value mappings generated by tree.bins().
The "new.fctrs" returns the recategorized data.frame
head(binned.df)
The "lkup.list" returns a list of the lookup tables
lookup.list <- tree.bins(data = sample.df, y = SalePrice, bin.nm = "bin#.", control = rpart.control(cp = .01), return = "lkup.list") head(lookup.list[[1]])
Using tree.bins() the user will be able to recategorize factor class variables of one particular data.frame. Let’s assume, that down the road, they obtain a similar dataset that contains the same old categorical convention. In this case, a user may want to recategorize this new data.frame by the same lookup tables that were generated from the first data.frame. In this case, being able to bin other data.frames with the same lookup table would be quite useful. The example below takes in a subset of the AmesSubset data and returns a data.frame recategorized by the lookup list generated from the tree.bins() function.
oth.binned.df <- bin.oth(list = lookup.list, data = sample.df) head(oth.binned.df)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.