make | R Documentation |
This is the central, most important function
of the drake package. It runs all the steps of your
workflow in the correct order, skipping any work
that is already up to date. Because of how make()
tracks global functions and objects as dependencies of targets,
please restart your R session so the pipeline runs
in a clean reproducible environment.
make(
plan,
targets = NULL,
envir = parent.frame(),
verbose = 1L,
hook = NULL,
cache = drake::drake_cache(),
fetch_cache = NULL,
parallelism = "loop",
jobs = 1L,
jobs_preprocess = 1L,
packages = rev(.packages()),
lib_loc = NULL,
prework = character(0),
prepend = NULL,
command = NULL,
args = NULL,
recipe_command = NULL,
log_progress = TRUE,
skip_targets = FALSE,
timeout = NULL,
cpu = Inf,
elapsed = Inf,
retries = 0,
force = FALSE,
graph = NULL,
trigger = drake::trigger(),
skip_imports = FALSE,
skip_safety_checks = FALSE,
config = NULL,
lazy_load = "eager",
session_info = NULL,
cache_log_file = NULL,
seed = NULL,
caching = "main",
keep_going = FALSE,
session = NULL,
pruning_strategy = NULL,
makefile_path = NULL,
console_log_file = NULL,
ensure_workers = NULL,
garbage_collection = FALSE,
template = list(),
sleep = function(i) 0.01,
hasty_build = NULL,
memory_strategy = "speed",
layout = NULL,
spec = NULL,
lock_envir = NULL,
history = TRUE,
recover = FALSE,
recoverable = TRUE,
curl_handles = list(),
max_expand = NULL,
log_build_times = TRUE,
format = NULL,
lock_cache = TRUE,
log_make = NULL,
log_worker = FALSE
)
plan |
Workflow plan data frame.
A workflow plan data frame is a data frame
with a |
targets |
Character vector, names of targets to build. Dependencies are built too. You may supply static and/or whole dynamic targets, but no sub-targets. |
envir |
Environment to use. Defaults to the current
workspace, so you should not need to worry about this
most of the time. A deep copy of |
verbose |
Integer, control printing to the console/terminal.
|
hook |
Deprecated. |
cache |
drake cache as created by |
fetch_cache |
Deprecated. |
parallelism |
Character scalar, type of parallelism to use.
For detailed explanations, see
You could also supply your own scheduler function
if you want to experiment or aggressively optimize.
The function should take a single
|
jobs |
Maximum number of parallel workers for processing the targets.
You can experiment with |
jobs_preprocess |
Number of parallel jobs for processing the imports and doing other preprocessing tasks. |
packages |
Character vector packages to load, in the order
they should be loaded. Defaults to |
lib_loc |
Character vector, optional.
Same as in |
prework |
Expression (language object), list of expressions,
or character vector.
Code to run right before targets build.
Called only once if |
prepend |
Deprecated. |
command |
Deprecated. |
args |
Deprecated. |
recipe_command |
Deprecated. |
log_progress |
Logical, whether to log the progress
of individual targets as they are being built. Progress logging
creates extra files in the cache (usually the |
skip_targets |
Logical, whether to skip building the targets
in |
timeout |
|
cpu |
Same as the |
elapsed |
Same as the |
retries |
Number of retries to execute if the target fails.
Assign target-level retries with an optional |
force |
Logical. If |
graph |
Deprecated. |
trigger |
Name of the trigger to apply to all targets.
Ignored if |
skip_imports |
Logical, whether to totally neglect to
process the imports and jump straight to the targets. This can be useful
if your imports are massive and you just want to test your project,
but it is bad practice for reproducible data analysis.
This argument is overridden if you supply your own |
skip_safety_checks |
Logical, whether to skip the safety checks on your workflow. Use at your own peril. |
config |
Deprecated. |
lazy_load |
An old feature, currently being questioned.
For the current recommendations on memory management, see
If |
session_info |
Logical, whether to save the |
cache_log_file |
Name of the CSV cache log file to write.
If |
seed |
Integer, the root pseudo-random number generator
seed to use for your project.
In To ensure reproducibility across different R sessions,
On the first call to |
caching |
Character string, either
|
keep_going |
Logical, whether to still keep running |
session |
Deprecated. Has no effect now. |
pruning_strategy |
Deprecated. See |
makefile_path |
Deprecated. |
console_log_file |
Deprecated in favor of |
ensure_workers |
Deprecated. |
garbage_collection |
Logical, whether to call |
template |
A named list of values to fill in the |
sleep |
Optional function on a single numeric argument To conserve memory, For parallel processing, The To sleep for the same amount of time between checks,
you might supply something like |
hasty_build |
Deprecated |
memory_strategy |
Character scalar, name of the
strategy
For even more direct
control over which targets |
layout |
Deprecated. |
spec |
Deprecated. |
lock_envir |
Deprecated in |
history |
Logical, whether to record the build history
of your targets. You can also supply a
|
recover |
Logical, whether to activate automated data recovery.
The default is
How it works: if
If both conditions are met,
Functions |
recoverable |
Logical, whether to make target values recoverable
with |
curl_handles |
A named list of curl handles. Each value is an
object from
|
max_expand |
Positive integer, optional.
|
log_build_times |
Logical, whether to record build_times for targets.
Mac users may notice a 20% speedup in |
format |
Character, an optional custom storage format for targets
without an explicit |
lock_cache |
Logical, whether to lock the cache before running |
log_make |
Optional character scalar of a file name or
connection object (such as |
log_worker |
Logical, same as the |
nothing
In interactive sessions, consider r_make()
, r_outdated()
, etc.
rather than make()
, outdated()
, etc. The r_*()
drake
functions
are more reproducible when the session is interactive.
If you do run make()
interactively, please restart your R session
beforehand so your functions and global objects get loaded into
a clean reproducible environment. This prevents targets
from getting invalidated unexpectedly.
A serious drake workflow should be consistent and reliable,
ideally with the help of a main R script.
This script should begin in a fresh R session,
load your packages and functions in a dependable manner,
and then run make()
. Example:
https://github.com/wlandau/drake-examples/tree/main/gsp
.
Batch mode, especially within a container, is particularly helpful.
Interactive R sessions are still useful, but they easily grow stale. Targets can falsely invalidate if you accidentally change a function or data object in your environment.
It is possible to construct a workflow that tries to invalidate itself. Example:
plan <- drake_plan( x = { data(mtcars) mtcars$mpg }, y = mean(x) )
Here, because data()
loads mtcars
into the global environment,
the very act of building x
changes the dependencies of x
.
In other words, without safeguards, x
would not be up to date at
the end of make(plan)
.
Please try to avoid workflows that modify the global environment.
Functions such as data()
belong in your setup scripts
prior to make()
, not in any functions or commands that get called
during make()
itself.
For each target that is still problematic (e.g.
https://github.com/rstudio/gt/issues/297
)
you can safely run the command in its own special callr::r()
process.
Example: https://github.com/rstudio/gt/issues/297#issuecomment-497778735
. # nolint
When make()
runs, it locks the cache so other processes cannot modify it.
Same goes for outdated()
, vis_drake_graph()
, and similar functions
when make_imports = TRUE
. This is a safety measure to prevent simultaneous
processes from corrupting the cache. If you get an error saying that the
cache is locked, either set make_imports = FALSE
or manually force
unlock it with drake_cache()$unlock()
.
drake_plan()
,
drake_config()
,
vis_drake_graph()
,
outdated()
## Not run:
isolate_example("Quarantine side effects.", {
if (suppressWarnings(require("knitr"))) {
load_mtcars_example() # Get the code with drake_example("mtcars").
config <- drake_config(my_plan)
outdated(my_plan) # Which targets need to be (re)built?
make(my_plan) # Build what needs to be built.
outdated(my_plan) # Everything is up to date.
# Change one of your imported function dependencies.
reg2 = function(d) {
d$x3 = d$x^3
lm(y ~ x3, data = d)
}
outdated(my_plan) # Some targets depend on reg2().
make(my_plan) # Rebuild just the outdated targets.
outdated(my_plan) # Everything is up to date again.
if (requireNamespace("visNetwork", quietly = TRUE)) {
vis_drake_graph(my_plan) # See how they fit in an interactive graph.
make(my_plan, cache_log_file = TRUE) # Write a CSV log file this time.
vis_drake_graph(my_plan) # The colors changed in the graph.
# Run targets in parallel:
# options(clustermq.scheduler = "multicore") # nolint
# make(my_plan, parallelism = "clustermq", jobs = 2) # nolint
}
clean() # Start from scratch next time around.
}
# Dynamic branching
# Get the mean mpg for each cyl in the mtcars dataset.
plan <- drake_plan(
raw = mtcars,
group_index = raw$cyl,
munged = target(raw[, c("mpg", "cyl")], dynamic = map(raw)),
mean_mpg_by_cyl = target(
data.frame(mpg = mean(munged$mpg), cyl = munged$cyl[1]),
dynamic = group(munged, .by = group_index)
)
)
make(plan)
readd(mean_mpg_by_cyl)
})
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.