About

targets can reproducibly watch input files, output files, and literate programming documents. This chapter explores how to configure a target to automatically run when a file changes.

Setup

Run tar_destroy() to remove the targets from the previous chapter.

library(targets)
tar_destroy()

Load this chapter's quiz questions. Try not to peek in advance.

source("R/quiz.R")
source("5-files/answers.R")

Run the following command to write the required _targets.R script to the your working directory.

file.copy("5-files/initial_targets.R", "_targets.R", overwrite = TRUE)

Open _targets.R for editing

tar_edit()

Review: input data files

As we saw in 3-changes.Rmd, targets can watch files like data/churn.csv for changes. Let's review how this works. First, run the full pipeline from start to finish.

tar_make()

Verify that all targets are now up to date.

tar_make()

Now, remove the last line of data in the file.

library(tidyverse)
"data/churn.csv" %>%
  read_csv(col_types = cols()) %>%
  head(n = nrow(.) - 1) %>%
  write_csv("data/churn.csv")

When we call tar_make() again, all targets should rerun because they all depend on the upstream file target churn_file.

tar_make()

How do we configure churn_file and downstream targets to rerun when data/churn.csv changes?

A. The target's return value needs to be a character vector of file and directory paths. These paths get resolved at runtime, so we do not need to know them before we call tar_make(). B. In tar_target(), set the format argument equal to "file". That way, tar_make() knows the return value of churn_file is a bunch of file paths that need to be watched. C. Targets directly downstream need to mention the symbol churn_file (as opposed to the literal path "data/churn.csv") so tar_make() can discover the correct dependency relationships among targets. Always check with tar_visnetwork() to verify that your targets are connected properly in the dependency graph. D. All the above.

answer4_review("E")

Hint:

tar_read(churn_file)

Output files

In targets, we configure output files the exact same way. The only difference between input and output files is that output files are created when the target runs.

As an example, open _targets.R and create a new file targets churn_cor, which saves a CSV file and a plot of correlations (the correlation of each covariate with customer churn in the preprocessed testing data). Functions compute_cor() and plot_cor() in 5-files/functions.R do most of the work.

Open _targets.R for editing.

tar_edit()

Enter the new target below into the target list in _targets.R.

tar_target(
  churn_cor, {
    cor <- compute_cor(churn_recipe)
    plot <- plot_cor(cor)
    write_csv(cor, "cor.csv")
    ggsave(plot = plot, filename = "cor.png", width = 8, height = 8)
    # The return value must be a vector of paths to the files we write:
    paste0("cor.", c("csv", "png"))
  },
  format = "file" # Tells targets to track the return value (path) as a file.
)

Run the pipeline. Since all previous targets are up to date, only churn_cor should run.

tar_make()

What is the return value of the new churn_cor target?

tar_read(churn_cor)

A. c("cor.csv", "cor.png"), a vector of paths to the files produced by the target. B. A ggplot object with the correlation plot. C. A data frame of correlations. D. A function called churn_cor().

answer4_return("E")

Take a look at the new output file cor.png. You should see a plot of each variable's correlation with customer churn. Also glance at cor.csv, the output dataset with the correlations.

read_csv("cor.csv", col_types = cols())

Now, delete one of the output files and rerun the pipeline.

tmp <- file.remove("cor.png")
tar_make()

What happened? Why?

A. All targets reran because we changed a file. B. No target reran because targets does not track the deleted file. C. Because churn_cor is a correctly configured file target, tar_make() noticed when cor.png changed and automatically reran the target in order to repair the file. D. churn_cor because targets always treats character strings as file names.

answer4_delete("E")

Whenever a single target like churn_cor tracks multiple files or directories, tar_make() treats all those files as a single unit. The whole target invalidates when one of the files changes, and downstream targets must accept all the files together. When we come to the chapter on branching, we will use the special tar_files() function from the tarchetypes package to branch over the available files.

Literate programming

Literate programming is the practice of writing code and explanatory prose in the same source file. This R Markdown document is an example. All this time, we have been using literate programming on top of targets. But now, we will explore literate programming within a target.

R Markdown on its own

Let's pull an example R Markdown file.

tmp <- file.copy("5-files/results.Rmd", "results.Rmd", overwrite = TRUE)

Open results.Rmd for editing.

library(usethis)
edit_file("results.Rmd", open = TRUE)

Notice the calls to tar_load() and tar_read() in active the code chunks. results.Rmd. On its own, this report leverages the results of previous targets.

# Copy and paste the following directly into the R console.
# This code chunk will not render results.Rmd properly
# if called inside the R notebook.
library(rmarkdown)
render("results.Rmd")
browseURL("results.html")

R Markdown inside a target

We can put results.Rmd in a target so it automatically re-renders when its dependencies change. Open _targets.R for editing.

tar_edit()

Write library(tarchetypes) at the very top, and write tar_render(report_step, "results.Rmd") as a target in the pipeline.

# Do not run here.
library(tarchetypes)
# Existing setup code goes here.
list(
  # Existing calls to tar_target() stay here.
  tar_render(report_step, "results.Rmd")
)

tar_render() analyzes report.Rmd and constructs a target that depends on the report's tar_read()/tar_load() dependencies. To see this for yourself, take a look at the dependency graph.

tar_visnetwork()

Which targets does report_step depend on? Why?

A. None. No upstream targets are mentioned in the target's command. B. run_relu because the report calls tar_read(run_relu) in an active code chunk. C. run_sigmoid because the report calls tar_load(run_sigmoid) in an active code chunk. D. run_relu and run_sigmoid because the report calls tar_read(run_relu) and tar_load(run_sigmoid) in active code chunks.

answer4_deps1("E")

Add a dependency

At the bottom of the report.Rmd, add an active code chunk to print out the best model (tar_read(best_model)). Then, look at the graph again.

tar_visnetwork()

What changed? Why?

A. report_step is disconnected from the other nodes in the graph because we edited the report. B. The graph did not change because we did not run the report yet. C. report_step now depends on best_model in addition to run_relu and run_sigmoid. All three are mentioned in active code chunks with tar_load() and tar_read(). D. report_step is only connected to best_model now because you just added it as a dependency.

answer4_deps2("E")

Run the report

Run the whole pipeline. The newly added report_step target should run.

tar_make()

Verify that all targets are up to date now.

tar_make() # See also tar_outdated()

View results.html. You should see a print-out of the best model at the bottom.

browseURL("results.html")

Remove the output file

Remove the output HTML file.

unlink("results.html")

Then rerun the pipeline.

tar_make()

What happened? Why?

A. All targets reran because the file system change. B. The report_step target reran because tar_render() reproducibly tracks the output files of rmarkdown::render() and helps tar_make() respond to changes in results.html. C. The report_step target reran because results.Rmd changed. D. No targets reran because output files from R Markdown reports are not reproducibly tracked.

answer4_html("E")

Change the R Markdown source

Add some prose anywhere in the body of results.Rmd.

library(usethis)
edit_file("results.Rmd", open = TRUE)
# Add comments in the report to explain the results.

Now, rerun the pipeline.

tar_make()

What happened? Why?

A. Only report_step reran because the R Markdown source file changed and all the other targets stayed up to date. B. Only report_step reran because R Markdown reports always rerun. C. All targets reran because results.Rmd is a dependency of the whole pipeline. D. No targets reran because the pipeline does not track changes to the R Markdown source.

answer4_rmd("E")

Change an upstream dependency

Remove another line of data/churn.csv.

"data/churn.csv" %>%
  read_csv(col_types = cols()) %>%
  head(n = nrow(.) - 1) %>%
  write_csv("data/churn.csv")

Now, rerun the pipeline.

tar_make()

Did report_step rerun? Why?

A. No, because data/churn.csv is not a dependency of report_step. B. No, because the report.Rmd does not read data/churn.csv. C. Yes. The R Markdown report is downstream of a target that depends on data/churn.csv, and the change in the data file caused a chain reaction that changed the model outputs and thus invalidated report_step. D. Yes. If a single target imports a data file, all targets in the pipeline rerun.

answer4_rmd_data("E")

An aside: parameterized R Markdown

tarchetypes::tar_render() supports parameterized R Markdown, and parameters can be values of upstream targets in the pipeline. In the following pipeline:

list(
  tar_target(data, data.frame(x = seq_len(26), y = letters))
  tar_render(report, "report.Rmd", params = list(your_param = data))
)

the report target will run:

rmarkdown::render("report.Rmd", params = list(your_param = your_target))

where report.Rmd has the following YAML front matter:

---
title: report
output_format: html_document
params:
  your_param: "default value"
---

and the following code chunk:

print(params$your_param)

See these examples for a demonstration.



kenshuri/targets-tuto documentation built on April 19, 2021, 9:58 a.m.