knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
Categorical features often are a key part of many modern datasets in different industries. When it comes to performing supervised learning on these datasets, tree-based models theoretically (and in practice with certain software, like h2o) handle raw categorical features, but most other models do not. Many of the popular tree-based packages even require all numeric predictors. There are several other packages that encode categorical features as numeric features, including CatEncoders, dummies, FeatureHashing, fastDummies, h2o, makedummies, recipes, and vtreat. These packages either provide a more limited selection of encoding options and/or are signifantly larger in scope and much more heavyweight. cattonum
aims to provide a one-stop-shop for categorical encodings. Nothing more, nothing less.
We'll demonstrate how to use cattonum
by predicting flight delays (dep_delay
) in the the nycflights13::flights
dataset using random forests built with ranger
.
library(nycflights13) library(ranger) library(cattonum) suppressPackageStartupMessages(library(dplyr)) set.seed(4444) data(flights) str(flights)
There are a lot of flights here and we don't want our model training to take forever, so let's only take a subset of these observations. To simplify our analysis, we'll analyze only the three airlines with the most flights, consider only flights that were delayed in taking off, and remove some features.
airlines_to_keep <- flights %>% count(carrier) %>% top_n(3, n) %>% pull(carrier) flights <- flights %>% filter(carrier %in% airlines_to_keep, dep_delay > 0) %>% select(-c(year, dep_time, sched_dep_time, arr_time, sched_arr_time, arr_delay, flight, tailnum, time_hour))
In order to get more out of our time features, we'll do a quick transformation using the technique described here. Also, month
and day
are currently integers, but they really are categorical, so we now turn them into characters (or factors, it doesn't matter for cattonum
). For simplicity and to maintain focus on cattonum
, simply drop observations with missing values.
tot_mins <- 24 * 60 flights <- flights %>% mutate(min_of_day = 60 * hour + minute, cos_min_of_day = cos(2 * pi * min_of_day / tot_mins), sin_min_of_day = cos(2 * pi * min_of_day / tot_mins)) %>% select(-c(min_of_day, hour, minute)) %>% mutate(month = as.character(month), day = as.character(day)) %>% filter(complete.cases(.)) str(flights)
Now we turn to encoding our categorical features. Consider a comparison between label encoding, mean encoding, and a mix of frequency, label, and mean encoding. First, note a few key facts about the catto_*
functions.
train
, ...
, response
, and test
. train
holds the training data, ...
is where the columns to be encoded are specified (if not supplied, all character and factor columns are encoded), response
is the name of the response column for catto_loo
and catto_mean
, and test
(if supplied) holds the test data.data.frame
or a tibble
, whichever was passed.test
is not supplied, the functions return a data.frame
or tibble
, as described above. Otherwise, they return a length-two list holding the relevant encoded datasets with names train
and test
.dplyr
-style pipelines using %>%
from magrittr
.data.frame
or tibble
, and features can be specified in many different ways like in dplyr
. For example, the following are all equivalent for a data frame named dat
with columns x1
and x2
.catto_label(dat) catto_label(dat, x1, x2) catto_label(dat, c(x1, x2)) catto_label(dat, c("x1", "x2")) catto_label(dat, one_of(c("x1", "x2"))) # one_of is exported by dplyr catto_label(dat, one_of("x1", "x2"))
Here we make the encoded datasets.
label_encoded <- flights %>% catto_label() str(label_encoded) mean_encoded <- flights %>% catto_mean(response = dep_delay) str(mean_encoded) mix_encoded <- flights %>% catto_freq(dest) %>% catto_label(origin) %>% catto_mean(response = dep_delay) str(mix_encoded)
Now we can finally build the models. We define a short function get_oob_error
that builds an untuned random forest and returns the out-of-bag error.
encodings <- list(label = label_encoded, mean = mean_encoded, mix = mix_encoded) get_oob_error <- function(dat) { rf <- ranger(data = as.data.frame(dat), # ranger can't handle tibbles num.trees = 100, dependent.variable.name = "dep_delay") rf$prediction.error } lapply(encodings, get_oob_error)
Mean encoding gives us the lowest OOB error, followed by the mixed encodings and label encoding. This modeling setup of simply looking at OOB score on untuned random forests of 100 trees is not really a fair comparison, but it demonstrates the basic features of cattonum
.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.