remakeGenerator
remakeGenerator
will be the successor to workflowHelper
.
remakeGenerator
is internally cleaner and more flexible and extensible than workflowHelper
, and it is better suited to adapt with future updates to remake
. remakeGenerator
is tested and available for use.
workflowHelper
This package helps to analyze multiple datasets in multiple ways. Your workflow will be
plan_workflow()
and another to make
.remake
, whenever you change your code, your next job will only recompute the affected tasks. This minimizes headache when your workflow is under heavy development and unexpected changes happen frequently.workflowHelper
will arrange these commands in a workflow and manage your output.Before using this package, you should first learn about remake
. GNU make is recommended but not totally necessary.
Ensure that R and GNU make are installed, as well as the dependencies in the DESCRIPTION
. Open an R session and run
library(devtools)
install_github("wlandau/workflowHelper")
Alternatively, you can build the package from the source and install it by hand. First, ensure that git is installed. Next, open a command line program such as Terminal and enter the following commands.
git clone git@github.com:wlandau/workflowHelper.git
R CMD build workflowHelper
R CMD INSTALL ...
where ...
is replaced by the name of the tarball produced by R CMD build
.
Rtools
.The example and tests sometimes use system("make")
and similar commands. So if you're using the Windows operating system, you will need to install the Rtools
package.
You can run this example from start to finish with the run_example_workflowHelper()
function. Alternatively, you can set up earlier stages with write_example_workflowHelper()
or setup_example_workflowHelper()
and then run the output manually with remake::make()
or make
. Then, optionally, use the clean_example_workflowHelper()
function to remove all the files generated by run_example_workflowHelper()
. The details of the example are below.
Suppose I want to
I keep the functions to generate data, analyze data, etc. in code.R
, and the script to organize and set up the workflow is workflow.R
. There are also knitr
reports latex.Rnw
and markdown.Rmd
. You can generate these files with the write_example_workflowHelper()
function. Typically, in your own workflows, you will write these files by hand.
workflow.R
First, I list the R scripts containing my code and the packages dependencies.
library(workflowHelper)
sources = strings(code.R)
packages = strings(MASS)
# packages = strings(MASS, rmarkdown, tools) # Uncomment before building pdf/html
The strings
function converts R expressions into character strings, so I could have simply written sources = "code.R"
.
Next, I list the commands to generate the datasets.
datasets = commands(
normal16 = normal_dataset(n = 16),
poisson32 = poisson_dataset(n = 32),
poisson64 = poisson_dataset(n = 64)
)
Be sure to give a unique name to each command (for example, poisson_dataset(n = 32)
has the unique name poisson32
). The commands
function checks for names and returns a named character vector, so I could have simply written datasets = c(normal16 = "normal_dataset(n = 16)", poisson32 = "poisson_dataset(n = 32)", poisson64 = "poisson_dataset(n = 64)")
. To generate 4 replicates of each kind of dataset, write datasets = reps(datasets, 4)
.
Similarly, I specify the commands to analyze each dataset.
analyses = commands(
linear = linear_analysis(..dataset..),
quadratic = quadratic_analysis(..dataset..)
)
The ..dataset..
wildcard stands for the current dataset being analyzed, which in this case is an object returned by normal_dataset
or poisson_dataset
. Wildcards are case-insensitive, so ..DATASET..
and ..dAtAsEt
will also work.
For summaries of the analyses, there is an additional ..analysis..
wildcard that stands for the current object returned by linear_analysis
or quadratic_analysis
. Like ..dataset..
, ..analysis..
is case-insensitive, so ..ANALYSIS..
will also work.
summaries = commands(
mse = mse_summary(..dataset.., ..analysis..),
coef = coefficients_summary(..analysis..)
)
Next, I specify how to produce general output from the summaries, etc. Since coef.csv
has a file extension, it will automatically be treated as a file target.
output = commands(
coef_table = do.call(I("rbind"), coef),
coef.csv = write.csv(coef_table, target_name),
mse_vector = unlist(mse)
)
Now, we're ready to specify plots. (Here, the a plot: TRUE
line is automatically added to remake.yml
.)
plots = commands(
mse.pdf = hist(mse_vector, col = I("black"))
)
Finally, we can generate some reports.
reports = commands(
markdown.md = list("poisson32", "coef_table", "coef.csv"), # dependencies
latex.tex = TRUE # no dependencies here
# markdown.html = render("markdown.md", quiet = TRUE, clean = FALSE),
# latex.pdf = texi2pdf("latex.tex", clean = FALSE)
)
Since report.md
has a .md
extension, remake
will automatically look for report.Rmd
and knit it to report.md
with the knitr
package. Similarly, remake
will try to build latex.tex
from latex.Rnw
. In each case, the command is replaced with a character vector or list of characters denoting the dependencies of the report. These could be external files or
cached intermediate remake
objects such as
datasets or analyses. In the latter case, objects are automatically exported for use inside R code chunks as described
here
.
If you want to render markdown.md
to markdown.html
, be sure to include rmarkdown
in your packages. Similarly, to compile latex.tex
to latex.pdf
, include the tools
package. I commented out the lines to build markdown.html
and latex.pdf
in order to increase portability, but you may uncomment them if your copy of R
is connected to copies of LaTeX and Pandoc.
Optionally, I can prepend some lines to the overarching Makefile for the workflow.
begin = c("# This is my Makefile", "# Variables...")
The stages and elements of my workflow are now planned. To put them all together, I use plan_workflow
, which calls parallelRemake::write_makefile()
.
plan_workflow(sources, packages, datasets, analyses, summaries, output, begin)
Optionally, I can pass additional arguments to remake::make
using the remake_args
argument to plan_workflow
. For example, plan_workflow(..., remake_args = list(verbose = FALSE))
is equivalent to remake::make(..., verbose = F)
for each target. I cannot set target_names
or remake_file
this way. Also, if I want to suppress the writing of the Makefile, I can call plan_workflow(..., makefile = NULL)
.
After running the workflow.R
script above, I have a remake
/YAML file in my current working directory. To run the whole workflow in an R session with no parallel computing, simply open an R session and enter the following.
library(remake)
make(remake_file = "remake.yml")
Thanks to remake
, if I change functions in code.R
and then run make
again, only the outdated parts of the workflow will be rebuilt.
Running workflow.R
also produces a Makefile in the current working directory. Using this master Makefile and a command line program, I have several options for running the workflow with parallel computing. Here are some examples.
make
runs the full workflow, only building results that are out of date or missing.make -j <n>
is the same as above with the workflow distributed over <n>
parallel processes. Similarly, you can append -j <n>
to any of the commands below to activate parallelism.make datasets
just makes the datasets.make analyses
just runs the analyses of all the datasets after ensuring that the datasets are up to date.make summaries
computes individual summaries of each analysis of each dataset.make aggregates
aggregates the summaries together.make output
makes the final output of the workflow after ensuring all the previous results are up to date.make clean
removes the files generated by make
. If some of your files are produced by side effects, make clean
might not remove them. In that case, updates to dependencies may not trigger the desired rebuilds, so you should read the next section. make reset
runs make clean
and then removes the Makefile and all its constituent YAML files.Intermediate objects such as datasets, analyses, and summaries are maintained in remake
's hidden storr
cache. At any point in the workflow, you can reload them using recall
and check the available ones using recallable
. Let's go back to the example. First, I check to see the names of the objects I can reload.
> recallable()
[1] "coef" "coef_table"
[3] "mse" "mse_vector"
[5] "normal16" "normal16_linear"
[7] "normal16_linear_coef" "normal16_linear_mse"
[9] "normal16_quadratic" "normal16_quadratic_coef"
[11] "normal16_quadratic_mse" "poisson32"
[13] "poisson32_linear" "poisson32_linear_coef"
[15] "poisson32_linear_mse" "poisson32_quadratic"
[17] "poisson32_quadratic_coef" "poisson32_quadratic_mse"
[19] "poisson64" "poisson64_linear"
[21] "poisson64_linear_coef" "poisson64_linear_mse"
[23] "poisson64_quadratic" "poisson64_quadratic_coef"
[25] "poisson64_quadratic_mse"
>
Then if I want to load mse
, the list of summaries generated by mse_summary
in code.R
, I simply use recall
.
> recall("mse")
$normal16_linear
[1] 0.6394384
$normal16_quadratic
[1] 0.6394384
$poisson32_linear
[1] 4.991832
$poisson32_quadratic
[1] 4.991832
$poisson64_linear
[1] 3.613922
$poisson64_quadratic
[1] 3.613922
>
Important: do not manually access the files inside .remake/objects
for serious jobs. Changes via functions like recall()
and recallable()
are not tracked and thus not reproducible.
If you want to run make -j
to distribute tasks over multiple nodes of a Slurm cluster, refer to the Makefile in this post and write
write_makefile(...,
begin = c(
"SHELL=srun",
".SHELLFLAGS= <ARGS> bash -c"))
in an R session, where <ARGS>
stands for additional arguments to srun
. Then, once the Makefile is generated, you can run the workflow with
nohup make -j [N] &
in the command line, where [N]
is the number of simultaneous tasks.
For other task managers such as PBS, such an approach may not be possible. Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake
storr cache.
You may want to use the downsize package within your custom R source code. That way, you can run a quick scaled-down version of your workflow for debugging and testing before you run the full workload. In the example, just include downsize
in packages
inside workflow.R
and replace the top few lines of code.R
with the following.
library(downsize)
scale_down()
normal_dataset = function(n = 16){
ds(data.frame(x = rnorm(n, 1), y = rnorm(n, 5)), nrow = 4)
}
poisson_dataset = function(n = 16){
ds(data.frame(x = rpois(n, 1), y = rpois(n, 5)), nrow = 4)
}
The call scale_down()
sets the downsize
option to TRUE
, which is a signal to the ds
function. The command ds(A, ...)
says "Downsize A to a smaller object when getOption("downsize")
is TRUE
". For the full scaled-up workflow, just delete the first two lines or replace scale_down()
with scale_up()
. Unfortunately, remake
does not rebuild things when options are changed, so you'll have to run make clean
whenever you change the downsize
option.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.