knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, comment = "#>", fig.align = "center", fig.width = 10, fig.height = 7, out.width = "80%", out.height = "80%" )
options( warnPartialMatchArgs = FALSE, drake_clean_menu = FALSE, drake_make_menu = FALSE, htmltools.dir.version = FALSE ) packages <- c( "targets", "keras", "recipes", "rsample", "tidyverse", "yardstick" ) purrr::walk( packages, function(pkg) { suppressMessages(suppressWarnings(library(pkg, character.only = TRUE))) } ) Sys.setenv(TAR_SCRIPT_ASK = "false") tar_destroy()
???
Thank you all for coming, and thank you to R/Pharma for the opportunity to speak today.
I come from the life sciences, and we develop ambitious computational workflows for Statistics and data science. There's a lot of Bayesian analysis, machine learning, simulation, and prediction. Other domains have similar workloads, and we need to think about both efficiency and reproducibility right from the start.
https://openclipart.org/detail/275842/sisyphus-overcoming-silhouette
???
Many of these projects require long runtimes. Methods like Markov chain Monte Carlo and deep neural nets are computationally expensive. It could take hours or even days just to fit a single model. That's fine if you're only going to run the project once, or at regularly scheduled times. But if the code is still under development, it's easy to get trapped in a vicious Sisyphean cycle.
???
A large workflow has a large number of moving parts. We have datasets that we preprocess or simulate, analyses of those datasets, and summaries of the analyses.
???
If you change any one of these parts - whether it's a bugfix, a tweak to a model, or some new data -
???
Then everything that depends on it is no longer valid, and you need to rerun the computation to bring the results back up to date. This is seriously frustrating when you're in development and you're still making a constant stream of changes to code and data in real time. If every change means you need to rerun the project, there's no way the results can keep up...
???
...unless you use a pipeline tool. There are pipeline tools for production which resemble Apache Airflow, and there are pipeline tools for development which resemble GNU Make. Today, I'm going to focus on Make-like tools because those are the ones I think are designed for this part of the process. It's an action-packed space, and there are a lot of great options. But unfortunately, there's not a whole lot for R.
targets
?drake
.???
That's where targets comes in. targets is a Make-like pipeline tool that is fundamentally designed for R. You can call it from an R session, it supports a clean, idiomatic, function-oriented style of programming, and it helps you store and retrieve your results. Most importantly, it gets you out of the Sisyphean loop of long computation, enhances reproducibility, and takes the frustration out of data science.
drake
?drake
is still an excellent choice for pipeline management, but it has permanent user-side limitations.targets
was created to overcome these limitations and create a smoother user experience.dplyr::group_by()
.???
But what about drake? drake already does all these things, and it's still an excellent choice for Make-like pipeline management. But it does have permanent user-side limitations. We've been developing, improving, expanding, and refining drake for several years, and we've reached the point where the most important problems to tackle are exactly the problems we cannot solve in this tool. It's just too big and too set in its ways, and its architecture was originally designed around assumptions that no longer hold up. To overcome these permanent limitations, we need a new tool that borrows from drake's journey and advances the user experience beyond what drake is capable of, and that new tool is targets. targets has stronger guardrails, lighter data management, greater transparency around data and the process of watching for changes, more flexible dynamic branching, better parallel efficiency, and design that lets us build on top of it more easily. The targets package website has a statement of need describing the changes in more detail.
???
Let's look at an example. We're going to set up a machine learning project to predict customer dropout, or churn, based on the publicly available IBM Watson Telco customer churn dataset. We're going to train a bunch of deep neural nets on the data and pick the one with the best testing accuracy.
background-image: ./images/not.png
run_everything.R R/ ├── 01-data.R ├── 02-munge.R ├── 03-model.R ├── 04-results.R └── 05-plot.R data/ └── customer_churn.csv
???
Before we get started, let's talk about the implementation strategy. We're going to move away from numbered scripts and R Markdown as a way to manage the computation end to end. It's an okay strategy for small projects, but it falls apart quickly as a project grows.
- Everything that exists is an object.
- Everything that happens is a function call.
John Chambers
add_things <- function(argument1, argument2) { argument1 + argument2 } add_things(1, 2) add_things(c(3, 4), c(5, 6))
???
Functions scale much better for big stuff. A function is just a reusable set of instructions with multiple inputs and a single return value. Usually those inputs are explicitly defined and easy to create, and usually the function has an informative name. Functions are a fundamental built-in feature of almost every programming language we have, and they are particularly well suited to R, which was designed with formal functional programming principles in mind.
The most obvious use for functions is as a way to avoid copies of repeated code scattered throughout your project. So instead of copying and pasting the same code block everywhere, you just call the function. But functions are not just for code you want to reuse, they're for code you want to understand. Functions are custom shorthand. They make your work easier to read, understand, break down into manageable pieces, document, test, and validate for serious research.
split_data <- function(churn_file) { read_csv(churn_file, col_types = cols()) %>% initial_split(prop = 0.7) } prepare_recipe <- function(churn_data) { churn_data %>% training() %>% recipe(Churn ~ .) %>% step_rm(customerID) %>% step_naomit(all_outcomes(), all_predictors()) %>% step_discretize(tenure, options = list(cuts = 6)) %>% step_log(TotalCharges) %>% step_mutate(Churn = ifelse(Churn == "Yes", 1, 0)) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_center(all_predictors(), -all_outcomes()) %>% step_scale(all_predictors(), -all_outcomes()) %>% prep() }
???
Most of your functions revolve around 3 kinds of tasks: preparing datasets, analyzing datasets, and summarizing analyses. These two functions are all about the data. The first one accepts a file path, reads in the data, and splits it into training and test sets. The prepare_recipe() function accepts the return value from split_data() and converts the data into a format that our Keras models can accept in subsequent steps.
define_model <- function(churn_recipe, units1, units2, act1, act2, act3) { # ... } train_model <- function(churn_recipe, units1, units2, act1, act2, act3) { # ... } test_accuracy <- function(churn_data, churn_recipe, churn_model) { # ... } test_model <- function(churn_data, churn_recipe, units1, units2, act1, act2, act3) { # ... } retrain_run <- function(churn_run, churn_recipe) { # ... }
???
Similarly, you define functions to define, train, and test the Keras models.
_targets.R # Required top-level configuration file. R/ └── functions.R #<< data/ └── customer_churn.csv
???
You can organize your functions however you want, but it's common practice to put them in scripts inside an R/ folder. Similarly, input datasets can go anywhere, but a data/ folder just helps keep things clean.
And at this point, you have a good clean function-oriented project. Even if you decide not to use targets, this function-oriented style still has a lot of value. However, if you're thinking about using targets, then converting to functions is almost all of the work. Once you've done that, you're already almost there. All you need now is to outline the specific steps of the computation in a formal pipeline. And that's where this _targets.R script comes into play. It always lives at the root of the project, and it's always called "_targets.R".
# _targets.R library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(churn_file, "data/customer_churn.csv", format = "file"), tar_target(churn_data, split_data(churn_file)), tar_target(churn_recipe, prepare_recipe(churn_data)) )
tar_script({ library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(churn_file, "data/customer_churn.csv", format = "file"), tar_target(churn_data, split_data(churn_file)), tar_target(churn_recipe, prepare_recipe(churn_data)) ) })
???
The purpose of _targets.R is to set up the project at a high level. It loads the packages required to define the pipeline, it loads your custom functions and global objects, it sets high-level options such as the packages the targets are going to need, and it defines the pipeline at the very end.
At the bottom of _targets.R, you list out objects called targets. Each target is an individual step in the workflow. It has an informative name like "sim" or "patients", and it has a R command that invokes your custom functions and returns a value.
tar_manifest(fields = c("name", "command"))
???
There are several utility functions that inspect the pipeline for correctness. The tar_manifest() function shows you all the names of the targets and the R commands associated with them.
tar_visnetwork()
???
It's always good practice to visualize the dependency graph of the plan. targets has functions to do this for you, and it really demystifies how the package works. So here you see the flow of the project from left to right. We reproducibly track an input data file, we load that data to split into training and test, and we prepare that data for the models using a Tidymodels recipe.
But how does targets deduce this flow? How does it know that the churn_recipe depend on churn_data? The order you write targets in the pipeline does not matter. targets knows that churn_recipe depends on churn_data because the symbol "churn_data" is mentioned in the command for "churn_recipe" in the pipeline. targets scans your commands and functions without actually running them it in order to look for changes and understand dependency relationships. This is called static code analysis.
tar_make()
tar_make(callr_function = NULL)
???
To actually run the workflow, we use a function called tar_make(). tar_make() creates a clean new reproducible R process, runs _targets.R to populate the new session and define the pipeline, resolves the dependency graph, runs the correct targets in the correct order from the dependency graph, and writes the return values to storage.
tar_load()
and tar_read()
get targets from the _targets/
data store.ncol(training(tar_read(churn_data))) tar_load(churn_recipe) ncol(juice(churn_recipe))
???
Afterwards, all the targets are in storage. There's a special key-value store in a hidden _targets/ folder, and targets has functions tar_load() and tar_read() to retrieve data from the store. targets abstracts artifacts as ordinary objects. You don't need to worry about where these files are located, you just need to know the target names. This is the exploratory analysis phase. Always inspect your targets for issues between calls to tar_make().
.large[
1. Add a couple targets.
2. Run the pipeline with tar_make()
.
3. Inspect the new targets with tar_load()
and tar_read()
.
4. Repeat often. Not very time-consuming because tar_make()
skips up-to-date targets.
]
???
This is part of an iterative process for building up a pipeline. We add a couple targets, run the pipeline, inspect the new targets, and repeat. We do this early and often. And as you will see, because targets can iterate frequently while still saving you time.
# _targets.R library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(churn_file, "data/customer_churn.csv", format = "file"), tar_target(churn_data, split_data(churn_file)), tar_target(churn_recipe, prepare_recipe(churn_data)), tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), #<< tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)) #<< )
tar_script({ library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(churn_file, "data/customer_churn.csv", format = "file"), tar_target(churn_data, split_data(churn_file)), tar_target(churn_recipe, prepare_recipe(churn_data)), tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)) ) })
???
With the data in place, we are now ready to add some models. We start with two models fit to the same data using different hyperparameters.
tar_outdated()
tar_outdated(reporter = "silent", callr_function = NULL)
???
When we inspect the pipeline, we see that some work needs to be done. But all our data processing steps are already up to date.
tar_visnetwork()
???
We can see this even more clearly in the dependency graph.
tar_make()
tar_make(callr_function = NULL)
???
So when we run the pipeline again, only the models run. The tool skips the data targets because they are already up to date.
tar_read(run_relu) tar_read(run_sigmoid)
???
And as usual, we read our newest targets and verify there are no obvious issues.
# _targets.R library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( ..., tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)), tar_target( #<< best_run, #<< bind_rows(run_relu, run_sigmoid) %>% #<< top_n(1, accuracy) %>% #<< head(1) #<< ), #<< tar_target( #<< best_model, #<< retrain_run(best_run, churn_recipe), #<< format = "keras" #<< ) #<< )
tar_script({ library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(churn_file, "data/customer_churn.csv", format = "file"), tar_target(churn_data, split_data(churn_file)), tar_target(churn_recipe, prepare_recipe(churn_data)), tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)), tar_target( best_run, bind_rows(run_relu, run_sigmoid) %>% top_n(1, accuracy) %>% head(1) ), tar_target( best_model, retrain_run(best_run, churn_recipe), format = "keras" ) ) })
???
And we repeat the process to gradually build up the pipeline. Here, we add new targets to find the model run with the highest accuracy so far and retrain that model to return a fitted model object. For fitted Keras models, we need to write 'format = "keras"' in tar_target() because Keras models cannot be saved with R's ordinary serialization functionality.
tar_make()
tar_make(callr_function = NULL)
???
As before, only only the new targets run because we didn't change any other code or data,
tar_read(best_model)
???
and now one of our targets is actually a trained Keras model.
# _targets.R library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( ..., tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)), tar_target(run_softmax, test_model(act1 = "softmax", churn_data, churn_recipe)), #<< tar_target( best_run, bind_rows(run_relu, run_sigmoid, run_softmax) %>% #<< top_n(1, accuracy) %>% head(1) ), ... )
tar_script({ library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(churn_file, "data/customer_churn.csv", format = "file"), tar_target(churn_data, split_data(churn_file)), tar_target(churn_recipe, prepare_recipe(churn_data)), tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)), tar_target(run_softmax, test_model(act1 = "softmax", churn_data, churn_recipe)), tar_target( best_run, bind_rows(run_relu, run_sigmoid, run_softmax) %>% top_n(1, accuracy) %>% head(1) ), tar_target( best_model, retrain_run(best_run, churn_recipe), format = "keras" ) ) })
???
If we try another model, we just need to add another target and reference that new target in any downstream targets that need it.
tar_outdated()
tar_outdated(reporter = "silent", callr_function = NULL)
???
Now, not only are the new model and the best_run target outdated, the tool automatically calls everything downstream into question, so best_model is suspect as well.
tar_visnetwork()
???
Again, we see this clearly in the dependency graph. run_softmax is new, best_run is outdated because it uses run_softmax, and best_model is suspect because it depends on best_run.
best_run
.drake
does not bother to retrain the best model.tar_make() #> ✓ skip target churn_file #> ✓ skip target churn_data #> ✓ skip target churn_recipe #> ✓ skip target run_relu #> ✓ skip target run_sigmoid #> ● run target run_softmax #> ● run target best_run #> ✓ skip target best_model
tar_make(callr_function = NULL, reporter = "silent")
???
When we run the pipeline again, of course run_softmax and best_run both execute. But if best_run doesn't change, we skip best_model. targets looks at the actual fingerprint of the data to make decisions, unlike GNU Make, which just uses timestamps.
define_model <- function(churn_recipe, units1, units2, act1, act2, act3) { input_shape <- ncol( juice(churn_recipe, all_predictors(), composition = "matrix") ) keras_model_sequential() %>% layer_dense( units = units1, kernel_initializer = "uniform", activation = act1, input_shape = input_shape ) %>% layer_dropout(rate = 0.2) %>% # previously 0.1 #<< layer_dense( units = units2, kernel_initializer = "uniform", activation = act2 ) %>% layer_dropout(rate = 0.1) %>% ...
???
If we change a function in a nontrivial way, targets notices. Let's increase the dropout rate in one of the layers in the neural net.
tar_visnetwork()
???
The define_model() function is no longer up to date. That means neither is train_model() because it calls define_model(), and likewise with downstream functions. That means the targets downstream, which include all our model runs, are also invalidated.
tar_make() #> ✓ skip target churn_file #> ✓ skip target churn_data #> ✓ skip target churn_recipe #> ● run target run_relu #> ● run target run_sigmoid #> ● run target run_softmax #> ● run target best_run #> ● run target best_model
???
But we can bring everything back up to date without rerunning the data processing.
tar_make() #> ● run target churn_file #> ● run target churn_data #> ● run target churn_recipe #> ● run target run_relu #> ● run target run_sigmoid #> ● run target run_softmax #> ● run target best_run #> ● run target best_model
???
But the data processing does rerun if we change our data file. This is because in our upstream target churn_file, we selected 'format = "file"' in the tar_target() function. That tells targets to treat the return value of the command as a vector of file and directory paths, and it watches that data for changes.
tar_make(plan) #> ✓ skip target churn_file #> ✓ skip target churn_data #> ✓ skip target churn_recipe #> ✓ skip target run_relu #> ✓ skip target run_sigmoid #> ✓ skip target run_softmax #> ✓ skip target best_run #> ✓ skip target best_model #> ✓ Already up to date.
???
At the end of the day, targets can tell you if all your targets are up to date. This is tangible evidence that your output matches the code and data it's supposed to come from. It's evidence that someone else running the same code would get the same results. That's reproducibility. It's certainly not the only form of reproducibility, but it does increase the trust we can place in the conclusions of the project.
tar_outdated() #> character(0) tar_visnetwork()
???
And again, utility functions can confirm the status of targets without needing to run tar_make().
targets
:install.packages("remotes") remotes::install_github("ropensci/targets")
???
There are several resources to learn about targets. There's a reference website, an online user manual, and a repository with the example code from today.
targets is nearing the end of its beta phase. I am about to submit it to rOpenSci for peer review, and I hope to have it on CRAN at the end of this year or early next year. Now is a great time for feedback because there is no formal release yet and the interface is not yet set in stone.
Topic | Notebook
---|---
Functions | 1-functions.Rmd
Pipelines | 2-pipelines.Rmd
Changes | 3-changes.Rmd
Debugging | 4-debugging.Rmd
Files | 5-files.Rmd
Branching | 6-branching.Rmd
Challenge | 7-challenge.Rmd
tar_destroy() unlink("_targets.R")
???
The workshop is publicly available and deployed to RStudio Cloud. Just sign up for a cloud account and log into a free instance of RStudio Server. We will spend our remaining time working through interactive exercises in the R notebooks here, with some breaks for Q&A and live coding demos to go over the solutions. If you missed part of the exercises or just want to go back and study in your own time, the workshop will still be available. I included the link to both the source and the cloud project here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.