knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  comment = "#>",
  fig.align = "center",
  fig.width = 10,
  fig.height = 7,
  out.width = "80%",
  out.height = "80%"
)
options(
  warnPartialMatchArgs = FALSE,
  drake_clean_menu = FALSE,
  drake_make_menu = FALSE,
  htmltools.dir.version = FALSE
)
packages <- c(
  "targets",
  "keras",
  "recipes",
  "rsample",
  "tidyverse",
  "yardstick"
)
purrr::walk(
  packages,
  function(pkg) {
    suppressMessages(suppressWarnings(library(pkg, character.only = TRUE)))
  }
)
Sys.setenv(TAR_SCRIPT_ASK = "false")
tar_destroy()

Large statistical computation

???

Thank you all for coming, and thank you to R/Pharma for the opportunity to speak today.

I come from the life sciences, and we develop ambitious computational workflows for Statistics and data science. There's a lot of Bayesian analysis, machine learning, simulation, and prediction. Other domains have similar workloads, and we need to think about both efficiency and reproducibility right from the start.


Common features

  1. Heavy use of the R language.
  2. Long runtimes.
  3. Multiple sub-tasks.
  4. Frequent changes to code and data.


https://openclipart.org/detail/275842/sisyphus-overcoming-silhouette

???

Many of these projects require long runtimes. Methods like Markov chain Monte Carlo and deep neural nets are computationally expensive. It could take hours or even days just to fit a single model. That's fine if you're only going to run the project once, or at regularly scheduled times. But if the code is still under development, it's easy to get trapped in a vicious Sisyphean cycle.


Interconnected tasks

???

A large workflow has a large number of moving parts. We have datasets that we preprocess or simulate, analyses of those datasets, and summaries of the analyses.


Changes

???

If you change any one of these parts - whether it's a bugfix, a tweak to a model, or some new data -


Consequences

???

Then everything that depends on it is no longer valid, and you need to rerun the computation to bring the results back up to date. This is seriously frustrating when you're in development and you're still making a constant stream of changes to code and data in real time. If every change means you need to rerun the project, there's no way the results can keep up...


Pipeline tools and workflow managers

???

...unless you use a pipeline tool. There are pipeline tools for production which resemble Apache Airflow, and there are pipeline tools for development which resemble GNU Make. Today, I'm going to focus on Make-like tools because those are the ones I think are designed for this part of the process. It's an action-packed space, and there are a lot of great options. But unfortunately, there's not a whole lot for R.


What distinguishes targets?

???

That's where targets comes in. targets is a Make-like pipeline tool that is fundamentally designed for R. You can call it from an R session, it supports a clean, idiomatic, function-oriented style of programming, and it helps you store and retrieve your results. Most importantly, it gets you out of the Sisyphean loop of long computation, enhances reproducibility, and takes the frustration out of data science.


What about drake?

???

But what about drake? drake already does all these things, and it's still an excellent choice for Make-like pipeline management. But it does have permanent user-side limitations. We've been developing, improving, expanding, and refining drake for several years, and we've reached the point where the most important problems to tackle are exactly the problems we cannot solve in this tool. It's just too big and too set in its ways, and its architecture was originally designed around assumptions that no longer hold up. To overcome these permanent limitations, we need a new tool that borrows from drake's journey and advances the user experience beyond what drake is capable of, and that new tool is targets. targets has stronger guardrails, lighter data management, greater transparency around data and the process of watching for changes, more flexible dynamic branching, better parallel efficiency, and design that lets us build on top of it more easily. The targets package website has a statement of need describing the changes in more detail.


Example targets workflow

???

Let's look at an example. We're going to set up a machine learning project to predict customer dropout, or churn, based on the publicly available IBM Watson Telco customer churn dataset. We're going to train a bunch of deep neural nets on the data and pick the one with the best testing accuracy.


background-image: ./images/not.png

Move away from numbered imperative scripts.

run_everything.R
R/
├── 01-data.R
├── 02-munge.R
├── 03-model.R
├── 04-results.R
└── 05-plot.R
data/
└── customer_churn.csv

???

Before we get started, let's talk about the implementation strategy. We're going to move away from numbered scripts and R Markdown as a way to manage the computation end to end. It's an okay strategy for small projects, but it falls apart quickly as a project grows.


Embrace functions.

  • Everything that exists is an object.
  • Everything that happens is a function call.

John Chambers

add_things <- function(argument1, argument2) {
  argument1 + argument2
}

add_things(1, 2)

add_things(c(3, 4), c(5, 6))

???

Functions scale much better for big stuff. A function is just a reusable set of instructions with multiple inputs and a single return value. Usually those inputs are explicitly defined and easy to create, and usually the function has an informative name. Functions are a fundamental built-in feature of almost every programming language we have, and they are particularly well suited to R, which was designed with formal functional programming principles in mind.

The most obvious use for functions is as a way to avoid copies of repeated code scattered throughout your project. So instead of copying and pasting the same code block everywhere, you just call the function. But functions are not just for code you want to reuse, they're for code you want to understand. Functions are custom shorthand. They make your work easier to read, understand, break down into manageable pieces, document, test, and validate for serious research.


Functions for customer churn

split_data <- function(churn_file) {
  read_csv(churn_file, col_types = cols()) %>%
    initial_split(prop = 0.7)
}

prepare_recipe <- function(churn_data) {
  churn_data %>%
    training() %>%
    recipe(Churn ~ .) %>%
    step_rm(customerID) %>%
    step_naomit(all_outcomes(), all_predictors()) %>%
    step_discretize(tenure, options = list(cuts = 6)) %>%
    step_log(TotalCharges) %>%
    step_mutate(Churn = ifelse(Churn == "Yes", 1, 0)) %>%
    step_dummy(all_nominal(), -all_outcomes()) %>%
    step_center(all_predictors(), -all_outcomes()) %>%
    step_scale(all_predictors(), -all_outcomes()) %>%
    prep()
}

???

Most of your functions revolve around 3 kinds of tasks: preparing datasets, analyzing datasets, and summarizing analyses. These two functions are all about the data. The first one accepts a file path, reads in the data, and splits it into training and test sets. The prepare_recipe() function accepts the return value from split_data() and converts the data into a format that our Keras models can accept in subsequent steps.


Functions for customer churn

define_model <- function(churn_recipe, units1, units2, act1, act2, act3) {
  # ...
}

train_model <- function(churn_recipe, units1, units2, act1, act2, act3) {
  # ...
}

test_accuracy <- function(churn_data, churn_recipe, churn_model) {
  # ...
}

test_model <- function(churn_data, churn_recipe, units1, units2, act1, act2, act3) {
  # ...
}

retrain_run <- function(churn_run, churn_recipe) {
  # ...
}

???

Similarly, you define functions to define, train, and test the Keras models.


Typical project structure

_targets.R # Required top-level configuration file.
R/
└── functions.R #<<
data/
└── customer_churn.csv

???

You can organize your functions however you want, but it's common practice to put them in scripts inside an R/ folder. Similarly, input datasets can go anywhere, but a data/ folder just helps keep things clean.

And at this point, you have a good clean function-oriented project. Even if you decide not to use targets, this function-oriented style still has a lot of value. However, if you're thinking about using targets, then converting to functions is almost all of the work. Once you've done that, you're already almost there. All you need now is to outline the specific steps of the computation in a formal pipeline. And that's where this _targets.R script comes into play. It always lives at the root of the project, and it's always called "_targets.R".


Build up your workflow as a list of targets.

# _targets.R
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  tar_target(churn_file, "data/customer_churn.csv", format = "file"),
  tar_target(churn_data, split_data(churn_file)),
  tar_target(churn_recipe, prepare_recipe(churn_data))
)
tar_script({
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  tar_target(churn_file, "data/customer_churn.csv", format = "file"),
  tar_target(churn_data, split_data(churn_file)),
  tar_target(churn_recipe, prepare_recipe(churn_data))
)
})

???

The purpose of _targets.R is to set up the project at a high level. It loads the packages required to define the pipeline, it loads your custom functions and global objects, it sets high-level options such as the packages the targets are going to need, and it defines the pipeline at the very end.

At the bottom of _targets.R, you list out objects called targets. Each target is an individual step in the workflow. It has an informative name like "sim" or "patients", and it has a R command that invokes your custom functions and returns a value.


The pipeline is a collection of skippable targets.

tar_manifest(fields = c("name", "command"))

???

There are several utility functions that inspect the pipeline for correctness. The tar_manifest() function shows you all the names of the targets and the R commands associated with them.


targets understands code and data dependencies.

tar_visnetwork()

???

It's always good practice to visualize the dependency graph of the plan. targets has functions to do this for you, and it really demystifies how the package works. So here you see the flow of the project from left to right. We reproducibly track an input data file, we load that data to split into training and test, and we prepare that data for the models using a Tidymodels recipe.

But how does targets deduce this flow? How does it know that the churn_recipe depend on churn_data? The order you write targets in the pipeline does not matter. targets knows that churn_recipe depends on churn_data because the symbol "churn_data" is mentioned in the command for "churn_recipe" in the pipeline. targets scans your commands and functions without actually running them it in order to look for changes and understand dependency relationships. This is called static code analysis.


Build your first targets.

tar_make()
tar_make(callr_function = NULL)

???

To actually run the workflow, we use a function called tar_make(). tar_make() creates a clean new reproducible R process, runs _targets.R to populate the new session and define the pipeline, resolves the dependency graph, runs the correct targets in the correct order from the dependency graph, and writes the return values to storage.


Check the targets for problems

ncol(training(tar_read(churn_data)))

tar_load(churn_recipe)
ncol(juice(churn_recipe))

???

Afterwards, all the targets are in storage. There's a special key-value store in a hidden _targets/ folder, and targets has functions tar_load() and tar_read() to retrieve data from the store. targets abstracts artifacts as ordinary objects. You don't need to worry about where these files are located, you just need to know the target names. This is the exploratory analysis phase. Always inspect your targets for issues between calls to tar_make().


Build up the pipeline gradually.

.large[ 1. Add a couple targets. 2. Run the pipeline with tar_make(). 3. Inspect the new targets with tar_load() and tar_read(). 4. Repeat often. Not very time-consuming because tar_make() skips up-to-date targets. ]

???

This is part of an iterative process for building up a pipeline. We add a couple targets, run the pipeline, inspect the new targets, and repeat. We do this early and often. And as you will see, because targets can iterate frequently while still saving you time.


Add some models.

# _targets.R
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  tar_target(churn_file, "data/customer_churn.csv", format = "file"),
  tar_target(churn_data, split_data(churn_file)),
  tar_target(churn_recipe, prepare_recipe(churn_data)),
  tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), #<<
  tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)) #<<
)
tar_script({
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  tar_target(churn_file, "data/customer_churn.csv", format = "file"),
  tar_target(churn_data, split_data(churn_file)),
  tar_target(churn_recipe, prepare_recipe(churn_data)),
  tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
  tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe))
)
})

???

With the data in place, we are now ready to add some models. We start with two models fit to the same data using different hyperparameters.


Previous work is still up to date.

tar_outdated()
tar_outdated(reporter = "silent", callr_function = NULL)

???

When we inspect the pipeline, we see that some work needs to be done. But all our data processing steps are already up to date.


Previous work is still up to date.

tar_visnetwork()

???

We can see this even more clearly in the dependency graph.


Up-to-date targets are skipped.

tar_make()
tar_make(callr_function = NULL)

???

So when we run the pipeline again, only the models run. The tool skips the data targets because they are already up to date.


Inspect the newest targets.

tar_read(run_relu)

tar_read(run_sigmoid)

???

And as usual, we read our newest targets and verify there are no obvious issues.


Find the best model

# _targets.R
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  ...,
  tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
  tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)),
  tar_target( #<<
    best_run, #<<
    bind_rows(run_relu, run_sigmoid) %>% #<<
      top_n(1, accuracy) %>% #<<
      head(1) #<<
  ), #<<
  tar_target( #<<
    best_model, #<<
    retrain_run(best_run, churn_recipe), #<<
    format = "keras" #<<
  ) #<<
)
tar_script({
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  tar_target(churn_file, "data/customer_churn.csv", format = "file"),
  tar_target(churn_data, split_data(churn_file)),
  tar_target(churn_recipe, prepare_recipe(churn_data)),
  tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
  tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)),
  tar_target(
    best_run,
    bind_rows(run_relu, run_sigmoid) %>%
      top_n(1, accuracy) %>%
      head(1)
  ),
  tar_target(
    best_model,
    retrain_run(best_run, churn_recipe),
    format = "keras"
  )
)
})

???

And we repeat the process to gradually build up the pipeline. Here, we add new targets to find the model run with the highest accuracy so far and retrain that model to return a fitted model object. For fitted Keras models, we need to write 'format = "keras"' in tar_target() because Keras models cannot be saved with R's ordinary serialization functionality.


Find the best model

tar_make()
tar_make(callr_function = NULL)

???

As before, only only the new targets run because we didn't change any other code or data,


Find the best model

tar_read(best_model)

???

and now one of our targets is actually a trained Keras model.


Try another model.

# _targets.R
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  ...,
  tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
  tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)),
  tar_target(run_softmax, test_model(act1 = "softmax", churn_data, churn_recipe)), #<<
  tar_target(
    best_run,
    bind_rows(run_relu, run_sigmoid, run_softmax) %>% #<<
      top_n(1, accuracy) %>%
      head(1)
  ),
  ...
)
tar_script({
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  tar_target(churn_file, "data/customer_churn.csv", format = "file"),
  tar_target(churn_data, split_data(churn_file)),
  tar_target(churn_recipe, prepare_recipe(churn_data)),
  tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)),
  tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)),
  tar_target(run_softmax, test_model(act1 = "softmax", churn_data, churn_recipe)),
  tar_target(
    best_run,
    bind_rows(run_relu, run_sigmoid, run_softmax) %>%
      top_n(1, accuracy) %>%
      head(1)
  ),
  tar_target(
    best_model,
    retrain_run(best_run, churn_recipe),
    format = "keras"
  )
)
})

???

If we try another model, we just need to add another target and reference that new target in any downstream targets that need it.


What gets done stays done.

tar_outdated()
tar_outdated(reporter = "silent", callr_function = NULL)

???

Now, not only are the new model and the best_run target outdated, the tool automatically calls everything downstream into question, so best_model is suspect as well.


What gets done stays done.

tar_visnetwork()

???

Again, we see this clearly in the dependency graph. run_softmax is new, best_run is outdated because it uses run_softmax, and best_model is suspect because it depends on best_run.


New best model?

tar_make()
#> ✓ skip target churn_file
#> ✓ skip target churn_data
#> ✓ skip target churn_recipe
#> ✓ skip target run_relu
#> ✓ skip target run_sigmoid
#> ● run target run_softmax
#> ● run target best_run
#> ✓ skip target best_model
tar_make(callr_function = NULL, reporter = "silent")

???

When we run the pipeline again, of course run_softmax and best_run both execute. But if best_run doesn't change, we skip best_model. targets looks at the actual fingerprint of the data to make decisions, unlike GNU Make, which just uses timestamps.


What if we need to change a function?

define_model <- function(churn_recipe, units1, units2, act1, act2, act3) {
  input_shape <- ncol(
    juice(churn_recipe, all_predictors(), composition = "matrix")
  )
  keras_model_sequential() %>%
    layer_dense(
      units = units1,
      kernel_initializer = "uniform",
      activation = act1,
      input_shape = input_shape
    ) %>%
    layer_dropout(rate = 0.2) %>% # previously 0.1 #<<
    layer_dense(
      units = units2,
      kernel_initializer = "uniform",
      activation = act2
    ) %>%
    layer_dropout(rate = 0.1) %>%
    ...

???

If we change a function in a nontrivial way, targets notices. Let's increase the dropout rate in one of the layers in the neural net.


Show invalidated functions and targets.

tar_visnetwork()

???

The define_model() function is no longer up to date. That means neither is train_model() because it calls define_model(), and likewise with downstream functions. That means the targets downstream, which include all our model runs, are also invalidated.


Rerun invalidated targets.

tar_make()
#> ✓ skip target churn_file
#> ✓ skip target churn_data
#> ✓ skip target churn_recipe
#> ● run target run_relu
#> ● run target run_sigmoid
#> ● run target run_softmax
#> ● run target best_run
#> ● run target best_model

???

But we can bring everything back up to date without rerunning the data processing.


Similar story if the data file changes.

tar_make()
#> ● run target churn_file
#> ● run target churn_data
#> ● run target churn_recipe
#> ● run target run_relu
#> ● run target run_sigmoid
#> ● run target run_softmax
#> ● run target best_run
#> ● run target best_model

???

But the data processing does rerun if we change our data file. This is because in our upstream target churn_file, we selected 'format = "file"' in the tar_target() function. That tells targets to treat the return value of the command as a vector of file and directory paths, and it watches that data for changes.


Evidence of reproducibility

tar_make(plan)
#> ✓ skip target churn_file
#> ✓ skip target churn_data
#> ✓ skip target churn_recipe
#> ✓ skip target run_relu
#> ✓ skip target run_sigmoid
#> ✓ skip target run_softmax
#> ✓ skip target best_run
#> ✓ skip target best_model
#> ✓ Already up to date.

???

At the end of the day, targets can tell you if all your targets are up to date. This is tangible evidence that your output matches the code and data it's supposed to come from. It's evidence that someone else running the same code would get the same results. That's reproducibility. It's certainly not the only form of reproducibility, but it does increase the trust we can place in the conclusions of the project.


Evidence of reproducibility

tar_outdated()
#> character(0)

tar_visnetwork()

???

And again, utility functions can confirm the status of targets without needing to run tar_make().


Resources

install.packages("remotes")
remotes::install_github("ropensci/targets")

???

There are several resources to learn about targets. There's a reference website, an online user manual, and a repository with the example code from today.

targets is nearing the end of its beta phase. I am about to submit it to rOpenSci for peer review, and I hope to have it on CRAN at the end of this year or early next year. Now is a great time for feedback because there is no formal release yet and the interface is not yet set in stone.


The tutorial

  1. Sign up for a free account at https://rstudio.cloud.
  2. Log into https://rstudio.cloud/project/1699460.
  3. Work through the R notebooks in order.
  4. Optional: revisit the material again later at https://github.com/wlandau/targets-tutorial.

Topic | Notebook ---|--- Functions | 1-functions.Rmd Pipelines | 2-pipelines.Rmd Changes | 3-changes.Rmd Debugging | 4-debugging.Rmd Files | 5-files.Rmd Branching | 6-branching.Rmd Challenge | 7-challenge.Rmd

tar_destroy()
unlink("_targets.R")

???

The workshop is publicly available and deployed to RStudio Cloud. Just sign up for a cloud account and log into a free instance of RStudio Server. We will spend our remaining time working through interactive exercises in the R notebooks here, with some breaks for Q&A and live coding demos to go over the solutions. If you missed part of the exercises or just want to go back and study in your own time, the workshop will still be available. I included the link to both the source and the cloud project here.


References



kenshuri/targets-tuto documentation built on April 19, 2021, 9:58 a.m.