set.seed(43170) knitr::opts_chunk$set(echo = TRUE, results = "hold", collapse = TRUE, comment = "# >") options(tibble.print_min = 5, tibble.print_max = 5)
If your healthcareai models are running out of memory or are very slow, you might be thinking that something is wrong with the package or that you need a bigger computer. While that can be frustrating, there are several tricks that you can use to speed things up and save memory. Read on to see some options for using healthcareai with a large data set.
If you have a dataset with more than 20k rows or 50 columns, you might run into performance issues. The size of your data and the type of model training you use influence how long it will take. Some easy steps to reduce training time:
prep_data
.get_best_levels
or prep_data
's collapse_rare_factors
to
limit them.prep_data
transforms date columns into many columns by default. Use
convert_dates = FALSE
to prevent that.flash_models
instead of machine_learn
to train fewer models during
development work.machine_learn
or tune_models
with faster settings. Pick 1 model, and
use a tune depth of 5.The plot below provides a rough idea of how long you can expect model training to take for various sizes of datasets, models, and tuning settings. The largest dataset there, approximately 200k rows x 100 columns, requires about 30 minutes to train all three models if tuning isn't performed, and about six hours to tune all three models.
In general, glm tunes very efficiently; however due to its linear constraints, it may not always provide a performant model. xgb is fast if tuning isn't performed, but it is a complex model for which tuning can be both important and computationally intensive. When you decide to use xgb, we recommend a final round of model tuning with tune_depth
turned up at least several times higher than the default value of 10. Tuning time will increase linearly with tune_depth
, so you can expect turning it up from 10 to 30 to approximately triple model training time. Both xgb and rf can be quite expensive to tune; this is a result of healthcare.ai
exploring some computationally expensive regions in their hyperparameter space. It may be more efficient to examine the results of an initial random search with plot(models)
, and then tune over the most-promising region in hyperparameter space by passing a hyperparameters
data frame to tune_models
. You can see how the default hyperparameter search spaces are defined in healthcareai:::get_random_hyperparameters
.
If you want to squeeze every ounce of performance from one of these models; we suggest iteratively zeroing in on the region in hyperparameter space that optimizes performance.
If you are running out of memory, it's probably because your data is too large for the operations that you've asked R to do.
In general, machine learning can ignore useless columns. But with large data sets, it's better to use fewer columns that are more predictive to save computational cost. Remember, machine learning should be iterative. Add columns and see if they help, remove them and see if it hurts, try different transformations.
This document walks through some steps to help you build a better model in less time.
Start by loading up the flights dataset.
library(tidyverse) library(nycflights13) library(healthcareai) d <- nycflights13::flights d
You can get the size of a dataset with dim
, nrow
, ncol
and object_size
.
40.6 MB really doesn't seem like that much. Your computer probably has at least
8 GB of RAM. But you are going to prepare that data for machine learning and then
train several models to see which one fits the data best. Larger data sets take more
memory and time to process. Despite its small size, if you were to train a model
on this data, it would take a long time.
os <- object.size(d) print(paste("Data is", round(as.numeric(os)/1000000, 1), "mb."))
In general, 336k rows and 19 columns is large but should be workable. Keep in mind
that the data changes as you manipulate it though. For example, if we prep_data
,
the size of the data changes. This is because prep_data
transforms, adds, and
removes columns to prepare your data for machine learning. It will only modify
columns, never rows.
Prepping this data increased the number of columns from 19 to 155. The size went up an order of magnitude, to 408 MB. The categorical columns, those made up of characters or factors, are the reason for the extra columns, and thus, the size change. The amount of memory your data takes up is proportional to the number of cells in the table, rows times columns.
R works with data "in-memory." When the size of data loaded in the session
exceeds available memory, it starts using harddisk for memory, which is about
an order of magnitude slower. Check out the Grid View
of the Environment
tab in Rstudio. You can sort by object size to see if you're lugging around a
bunch of big items and what's taking up space. You can remove items from the
active environment (this doesn't touch anything on disk, but you'll have to
re-load/create it to get it back in R) with rm(object_name)
. More detail on
memory usage can be found in Hadley's book, "Advanced R."
library(healthcareai) d <- d %>% mutate(arr_delay = case_when(arr_delay > 15 ~ "Y", TRUE ~ "N"), flight_id = 1:nrow(d)) %>% select(-dep_time, -arr_time, -dep_delay, hour, minute) d_clean <- d %>% prep_data(outcome = arr_delay, tailnum, collapse_rare_factors = FALSE, add_levels = FALSE) dim(d_clean) os <- object.size(d_clean) print(paste("Prepped data is", round(as.numeric(os)/1000000, 1), "mb."))
Categorical columns must be multiplied into dummy columns for machine learning.
For example, carrier
has 16 unique values and those will be turned into 16
unique columns. These dummy columns are made up of 1s and 0s. There is one dummy
column for each unique value in the original column except one. The missing
value can be inferred from a 0 in the rest of the columns. For example, if your
gender column contained Male
and Female
, you would get 1 dummy column,
Gender_Male
, where males were 1
. Females would then be the 0
's. If you
wanted that dummy column to instead be Gender_Female
, you can use the prep
data argument, ref_data
, like this: ref_levels = c(gender = "Female")
. prep_data
then adds a column to collect missing values in the original column.
Be careful with categorical columns, as they can blow up your data set. If you had
a column of DRG codes (or tailnums
, which was removed above) with 100s of
unique values, your data would become hundreds of columns!
print(paste("Carrier has", unique(d$carrier) %>% length(), "unique values.")) d_clean %>% select(starts_with("carrier"))
One of the best ways to reduce training time and memory requirements is to limit the number of columns in the data. Here are some easy ways to do it.
Since character variables with lots of unique values, or high cardinality variables,
cause very wide datasets, the best way to limit width is understanding which
columns those are and whether or not you need them in your data. How many
unique values are in a character column, like destination
?
d %>% summarize_if(~ is.character(.x) | is.factor(.x) , n_distinct)
What about how those values correlate with the outcome? In this case, we'll do
a visualization of destination
, looking at the proportion of each that belongs
to the delayed group.
d %>% ggplot(aes(x = reorder(dest, arr_delay, function(x) -sum(x == "Y") / length(x)), fill = arr_delay)) + geom_bar(stat = "count", position = "fill") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) + xlab("Destination") + ylab("Proporition Delayed")
But really, you want to know what the distribution of all your variables looks like.
You should profile your data using the DataExplorer
package. It gives you
all sorts of good info about your data and generates an HTML report that you can
refer back to later to know if your data might have changed. Check out an
example here.
The simplest option is to remove categorical or date columns with more than say,
50 categories. You might throw out some information, but you can always add it
back in later if you need. You'll notice that's what I did above by ignoring tailnum
,
which has 4000 categories and doesn't likely contain much useful information.
We saw in the plot that destination
, definitely has useful information as some
destinations are rarely delayed while others are routinely delayed. 105
categories is simply too many to just include it though. Here, the
collapse_rare_factors
argument to prep_data
can help with that lumping some
of the rare categories into an "other" category. This example puts any category that
contains less than 2% of the total data into "other." destination
is shrunk to the
16 most common destinations.
d_clean2 <- d %>% mutate(hour_of_day = lubridate::hour(time_hour)) %>% prep_data(outcome = arr_delay, tailnum, collapse_rare_factors = 0.02) d_clean2 %>% select(starts_with("dest"))
By default, prep_data
will also transform dates into circular numeric columns that
can be used by a machine learning model. The simplest conversion is
to use categorical columns that eventually become 0/1 columns for each month and day
of week. That method is interpretable and will capture the fact that March is not
greater than February. However, it will cause the
the dataset to grow large. Circular representation allows all the same information to be
stored in fewer, numeric columns. It ensures that the switch from
December (12) to January (1) is equivalent to changing 1 month. Fewer columns are
created and models will train faster.
Using circular numeric dates, the one date-time column is converted into 7 numeric columns.
library(lubridate) d_clean2 <- d %>% select(-year, -month, -day) %>% prep_data(outcome = arr_delay, tailnum, collapse_rare_factors = 0.02, convert_dates = TRUE) d_clean2 %>% select(starts_with("time_hour"))
Using categorical dates, the one data column is turned into 23 sparse (mostly 0) columns.
d_clean2 <- d %>% select(-year, -month, -day) %>% prep_data(outcome = arr_delay, tailnum, collapse_rare_factors = 0.02, convert_dates = "categories") d_clean2 %>% select(starts_with("time_hour"))
Some categories, like tailnum
, are not handled well by the
collapse_rare_factors
argument. There are too many categories and they are fairly
evenly distributed. They all get moved into category other
, which doesn't
help at all. You could remove them by passing them to the ...
argument of
prep_data
, as I did with tailnum.
An alternative to grouping rare categories is to use add_best_levels
to make
columns for the categories (or levels) that are likely to help differentiate the
outcome variable. This function can be used for any grouping variable, like
zip code.
The following example will identify 20 tail numbers that could be predictive of
the outcome. Instead of removing tail number, These are good levels to add to try adding to
the data with bind_cols
. Similarly, you could select the best dest
values by
using dest
as the grouping variable.
data <- d %>% select(flight_id, arr_delay) ls <- d %>% select(flight_id, tailnum, arr_delay) d_best_levels <- data %>% add_best_levels(longsheet = ls, id = flight_id, groups = tailnum, outcome = arr_delay, n_levels = 20, min_obs = 50, missing_fill = 0) d_best_levels
At some point, adding more rows of data will no longer improve performance. Better data will always win against more of the same bad data.
If you have a large data set, like flights
, you probably have enough information
to do machine learning on a subset and still get good results.
Some common ways to reduce length:
Sample a random subset of data. This keeps half the original data with the same ratio of delayed to not delayed.
stratified_sample_d <- d %>% split_train_test(outcome = arr_delay, percent_train = .5) stratified_sample_d <- stratified_sample_d$train d %>% count(arr_delay) %>% arrange(desc(n)) stratified_sample_d %>% count(arr_delay) %>% arrange(desc(n))
Use only data from the last year or 2. Healthcare data often goes back several years, but you might find that data from 5 years ago isn't as predictive as more current data anyways!
# From an integer month column d_recent <- d %>% filter(month >= 6) # From a date column d_recent <- d %>% filter(lubridate::month(time_hour) >= 6)
In the case of unbalanced data sets, where there are many more "No" than "Yes" outcomes, throw out some of the "No" rows.
downsampled_d <- caret::downSample(d, as.factor(d$arr_delay)) %>% select(-Class, -tailnum) downsampled_d %>% count(arr_delay) %>% arrange(desc(n))
Let's train a model to predict whether a plane arrived more than
15 minutes late or not. The first thing you might do is use machine_learn
.
I stopped this code after I saw the warning message.
m <- machine_learn(downsampled_d, outcome = arr_delay)
Training new data prep recipe Removing the following 1 near-zero variance column(s). If you don't want to remove them, call prep_data with remove_near_zero_variance = FALSE. year arr_delay looks categorical, so training classification algorithms. You've chosen to tune 150 models (n_folds = 5 x tune_depth = 10 x length(models) = 3) on a 336,776 row dataset. This may take a while... Training with cross validation: Random Forest
Notice that machine_learn
gives the warning message that says it will be training
150 models! Under the hood, this function is doing some seriously rigorous
machine learning for you:
These steps ensure you get the best performing model while still being resistant
to overfitting. They come at the cost of computational complexity. You can speed
things up by using a shorter model list, models = "rf"
, less hyperparameter
tuning, tune_depth = 3
, and fewer folds, n_folds = 4
. Just know that you're
cutting corners for the sake of time. We recommend doing your development work
quickly using flash_models
but then doing the full tune before saving your
final model.
We can use prep_data
and flash_models
to get an idea of how long our models will take to
train. flash_models
requires at least 2 folds, meaning 2 models, each trained
on half the data. We recommend using at least 4 folds. If 4 models takes 6.5
minutes, 150 models will be in the ballpark of 4 hours.
start <- Sys.time() d_clean <- downsampled_d %>% prep_data(outcome = arr_delay, collapse_rare_factors = 0.03, convert_dates = FALSE) m_rf_1 <- flash_models(d_clean, outcome = arr_delay, models = "rf", n_folds = 4) Sys.time() - start
m_rf_1
If you don't need the finer control of your data preparation, machine_learn
with tune = FALSE
is equivalent to prep_data %>% flash_models
.
m <- machine_learn(d, outcome = arr_delay, models = "rf", tune = FALSE, n_folds = 4)
As different models take different amounts of time, it's good to know what performance looks like using both before choosing one to go forward with. Hyperparameter tuning with GLM is very efficient. However, in some situations, it won't be as accurate as other models, and it can be slower on very wide datasets (hundreds of columns).
On our flights
data, GLM was faster, taking only a quarter of the time RF did.
Its performance was not as high though. RF had an AUPR of 0.72 while GLM was 0.69.
The performance increase could potentially be worth your time as random forest appears to
better fit the data.
start <- Sys.time() d_clean <- downsampled_d %>% prep_data(outcome = arr_delay, collapse_rare_factors = 0.03, convert_dates = FALSE) m_glm_1 <- flash_models(d_clean, outcome = arr_delay, models = "glm", n_folds = 4) Sys.time() - start
m_glm_1
Now that you've settled on a model, you could use either machine_learn
with just
that model or tune_models
with selected tuning options. This gives you a
little more rigor than our svelte flash_models
but won't take 4 hours.
start <- Sys.time() d_clean <- downsampled_d %>% prep_data(outcome = arr_delay, collapse_rare_factors = 0.03, convert_dates = FALSE) m_glm_2 <- tune_models(d = d_clean, outcome = arr_delay, models = "glm", tune_depth = 5) Sys.time() - start
m_glm_2
We got different performance than when we used flash_models
. Why?
Hyperparameters are chosen randomly. We could have gotten lucky before. Using
5-fold cross validation is more rigorous as well. It will uncover any differences
there might be in subsets of the data and more closely mimic the production
environment.
Before simply getting more memory on your computer, try the above on your data. Hopefully, you end up with a better model in less time!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.