In dd-harp/rampdata: Finds data and places data for the RAMP project

Introduction

This notebook shows how to use the rampdata package in an application. This package will add several features to the main function, if we initialize it to use those features.

Logging that will be saved specially on the cluster.
Provenance of files saved with each file and provenance of the main function saved in the provenance directory beside the data.
All paths specified relative to the data directory that's configured for each user and machine. This lets you run from the cluster or from home.
Input files and output files can be specified to have a particular version by giving a workflow configuration file.

When you read this, please consider first, not the syntax, but whether this would fit with the flow of how you work. We can make the syntax easier and more automatic if it can first ensure it doesn't preclude how you work.

Main

The main function

We put code into packages, and one or more functions in that package are top-level functions. A researcher or a workflow tool will call those functions directly. These are main functions, and the rampdata package helps write them so that they run better on the cluster.

A main function can either be called interactively, at an R command line, or from the command line as R -e "function()", or from the command line as a script. In the last case, you write a script, such as

gisdemog::create_uganda_cover()

I would save that single line into a file uganda_cover.R and run it from the command line with Rscript uganda_cover.R --shapefile uganda.shp.

The usual layout of a main function

A main function function usually separates sanity-checking tasks and input/output tasks from the work of the function. We try to write the main function so that it can be called as a script, meaning called from R --no-save -e "package::runit()" or so it can be called as a function from the R prompt. We can write the function and wrap it to be called as a script in these two steps, where the second one processes command-line arguments.

create_uganda_cover <- function(shapefile, landscape) {
  inputs <- read_input_files(shapefile, landscape)
  outputs <- calculate_uganda_cover(inputs$shapefile, inputs$lanscape)
  write_outputs(outputs$cover)
}
create_uganda_cover_main <- do.call(create_uganda_cover, cover_parse_args(commandArgs()))

The first function is ready to call from the R command line. The second function is ready to call from the Bash command line on the cluster.

A main function as a script

Sometimes we write scripts. They aren't different from main functions in any way that's important for the rampdata additions.

args <- do.call(create_uganda_cover, cover_parse_args(commandArgs(trailing = TRUE)))
inputs <- read_input_files(args$shapefile, args$landscape)
outputs <- calculate_uganda_cover(inputs$shapefile, inputs$lanscape)
write_outputs(outputs$cover)

An updated version of the main function

Most of the new work in the main function will be done while parsing arguments. The one important piece to add is a tryCatch around the work of the main function. The goal is to clear the slate for provenance and ensure we save it at the end.

create_uganda_cover <- function(shapefile, landscape) {
  rampdata::clear.provenance()
  tryCatch({
    inputs <- read_input_files(shapefile, landscape)
    outputs <- calculate_uganda_cover(inputs$shapefile, inputs$lanscape)
    write_outputs(outputs$cover)
  }, finally = {
    rampdata::write.meta.data(
      rampdata::ramp_prov_path("package", "create_uganda_cover"))
  }
}
create_uganda_cover_main <- function() {
  do.call(create_uganda_cover, cover_parse_args(commandArgs()))
}

The provenance recording isn't nested. Clearing it will delete everything this package remembers.

Argument parsing and initialization

When we parse arguments to the code, we need to initialize

Logging - Tell it whether we are on the cluster or on a local machine. When on the cluster, warnings, errors, and fatal messages will be saved to a separate file. We can also change the logging level for the whole application or for particular packages or R scripts.
Workflow versions - This is a file with lists of versions and the file paths they affect.

The location of the data directory is initialized as needed from a configuration file in the home directory.

.cover.doc <- 'Create Uganda Coverage

Usage:
  create_uganda_coverage <workflow> [-v] [-q] [--local]

Options:
  -v       Set logging to debug level.
  -q       Set logging to warn level.
  --local  Working on a local machine, not the cluster.
'
cover_parse_args <- function(args = NULL) {
  if (is.null(args)) {
    args <- commandArgs(TRUE)
  }
  parsed <- docopt::docopt(.cover.doc, args = args, version = "1.0")

  calling_function <- deparse(sys.call(sys.parent(n = 1)))
  set_logging_from_args(
    utils::packageName(), calling_function, parsed$v, parsed$q, parsed$local
    )
  initialize_workflow(args$workflow)
}

File provenance along the way

Each time we read or write a file, we record its provenance. There is an incantation when we read. We can shorten these lines of code if we determine they do what we want.

pop_rp <- workflow_path("admin1pop")
file_rp <- add_path(pop_rp, file = "pop2000.csv")
dt <- data.table::fread(as.path(file_rp))
prov.input.file(as.path(file_rp), "admin1pop")

When we write, we add a step to save the current provenance of how the file was made. If we write a file called output.csv, then we should write its provenance beside it as output.toml.

pop_rp <- workflow_path("admin1pop")
file_rp <- add_path(pop_rp, file = "pop2000.csv")
dt <- data.table::fwrite(as.path(file_rp))
prov.output.file(as.path(file_rp), "admin1pop")
write.meta.data(
      as.path(add_path(file_rp, file = "pop2000.toml")))