butcher
In butcher: Model Butcher

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = requireNamespace("parsnip", quietly = TRUE)
)

library(butcher)
library(parsnip)

One of the benefits of working in R is the ease with which you can implement complex models and implement challenging data analysis pipelines. Take, for example, the parsnip package; with the installation of a few associated libraries and a few lines of code, you can fit something as sophisticated as a boosted tree:

fitted_model <- boost_tree(mode = "regression") %>%
  fit(mpg ~ ., data = mtcars)

Yet, while this code is compact, the underlying fitted result may not be. Since parsnip works as a wrapper for many modeling packages, its fitted model objects inherit the same properties as those that arise from the original modeling package. A straightforward example is the lm() function from the base stats package. Whether you leverage parsnip or not, you get the same result:

parsnip_lm <- linear_reg() %>% 
  fit(mpg ~ ., data = mtcars) 
parsnip_lm

Using just lm():

old_lm <- lm(mpg ~ ., data = mtcars) 
old_lm

Let's say we take this familiar old_lm approach in building a custom in-house modeling pipeline. Such a pipeline might entail wrapping lm() in other function, but in doing so, we may end up carrying around some unnecessary junk.

in_house_model <- function() {
  some_junk_in_the_environment <- runif(1e6) # we didn't know about
  lm(mpg ~ ., data = mtcars) 
}

The linear model fit that exists in our custom modeling pipeline is then:

library(lobstr)
obj_size(in_house_model())

But it is functionally the same as our old_lm, which only takes up:

obj_size(old_lm)

Ideally, we want to avoid saving this new in_house_model() on disk, when we could have something like old_lm that takes up less memory. But what the heck is going on here? We can examine possible issues with a fitted model object using the butcher package:

big_lm <- in_house_model()
weigh(big_lm, threshold = 0, units = "MB")

The problem here is in the terms component of big_lm. Because of how lm() is implemented in the base stats package (relying on intermediate forms of the data from model.frame and model.matrix) the environment in which the linear fit was created is carried along in the model output.

We can see this with the env_print() function from the rlang package:

library(rlang)
env_print(big_lm$terms)

To avoid carrying possible junk around in our production pipeline, whether it be associated with an lm() model (or something more complex), we can leverage axe_env() from the butcher package:

cleaned_lm <- axe_env(big_lm, verbose = TRUE)

Comparing it against our old_lm, we find:

weigh(cleaned_lm, threshold = 0, units = "MB")

And now it takes the same memory on disk:

weigh(old_lm, threshold = 0, units = "MB")

Axing the environment, however, is not the only functionality of butcher. This package provides five S3 generics that include:

axe_call(): Remove the call object.
axe_ctrl(): Remove the controls fixed for training.
axe_data(): Remove the original data.
axe_env(): Replace inherited environments with empty environments.
axe_fitted(): Remove fitted values.

In our case here with lm(), if we are only interested in prediction as the end product of our modeling pipeline, we could free up a lot of memory if we execute all the possible axe functions at once. To do so, we simply run butcher():

butchered_lm <- butcher(big_lm)
predict(butchered_lm, mtcars[, 2:11])

Alternatively, we can pick and choose specific axe functions, removing only those parts of the model object that we are no longer interested in characterizing.

butchered_lm <- big_lm %>%
  axe_env() %>% 
  axe_fitted()
predict(butchered_lm, mtcars[, 2:11])

The butcher package provides tooling to axe parts of the fitted output that are no longer needed, without sacrificing much functionality from the original model object.