In aappling-usgs/scipiper: Support functions for ushering data through a scientific workflow

Shared Cache {.tabset .tabset-fade .tabset-pills}

Alison Appling, July 31, 2018
Jake Zwart, February 4, 2019

output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Shared cache} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}

knitr::opts_chunk$set(echo = TRUE, cache=FALSE, collapse=TRUE)

Terms and concept

Terms

Data file: files used as input for or generated as output by computing (e.g. streamgage data, geospatial data, site metadata, modeling results)
Indicator file: git-versioned files that promise the analyst that the data file exists remotely
Build status file: git-versioned files that represent the collective project build status (e.g. which targets have been built and/or are up-to-date)
Target: points or steps in the analysis that can be files or R objects. Many of the targets in a shared cache project will be indicator files.
YAML file: human and machine readable file used to layout the project workflow, packages needed, and target dependencies. YAML stands for YAML Ain't Markup Language to distinguish its purpose as data-oriented, rather than document markup.

Concept

A shared cache (sc of scmake) is a cloud data storage location where raw, intermediate, and/or final data products from an analysis project are contributed to and accessible by multiple analysts. Not all scipiper projects will use a shared cache.

Data files only need to be local when the analyst is computing with the data file. Indicator files (.ind) represent the remote shared cache among project participants. This allows analyst #1 to compute steps A and B (e.g. streamflow data pull [step A] and aggreagation [step B]), upload output from steps A and B to the project's shared cache, and analyst #2 can use the output from step B without redoing the computing performed by analyst #1.

Workflow dependencies are connected via the indicator files. Recipes (e.g. R function) that push data files to the shared cache create indicator files, and these indicator files can be used in other recipes to pull down data files from the shared cache using scipiper functions such as gd_get() (if data is not already available locally).

Build files tell remake if the indicator files and their dependencies are up-to-date. Targets that are not saved as indicator or data files (e.g. R objects) will not have a build status file and will be excuted every time it is called.

Example of .ind file dependency where the function select_sites() (code snippet below yaml code) uses gd_get() and the indicator file 1_data/out/compiled_data.rds.ind as a dependency to pull down the data file (compiled_data.rds) from the shared cache.

target_default: 1_data

sources: - 1_data/src/gather_data.R - 1_data/src/select_sites.R

targets:

1_data: depends: - 1_data/out/compiled_data.rds.ind - 1_data/out/selected_sites.rds.ind

1_data/out/compiled_data.rds.ind: command: gather_and_share_stream_data( ind_file = target_name, state = I("WI"), gd_config = 'lib/cfg/gd_config.yml')

1_data/out/selected_sites.rds.ind: command: select_sites( ind_file = target_name, input_ind_file = '1_data/out/compiled_data.rds.ind', gd_config = 'lib/cfg/gd_config.yml')

Functions used in the above code snippet:

 
gather_and_share_stream_data = function(ind_file, state, gd_config){

  temp <- readNWISdata(stateCd = state, parameterCd = '00010', service = 'dv')

  data_file <- as_data_file(ind_file) # convert indicator file to data file format (drops .ind suffix)
  saveRDS(temp, data_file)
  gd_put(remote_ind = ind_file, local_source = data_file, config_file = gd_config)
}

select_sites = function(ind_file, input_ind_file, gd_config){

  temp = readRDS(sc_retrieve(input_ind_file))

  temp_sites <- temp %>%
    dplyr::filter(
      dateTime > as.POSIXct('2012-01-01')) %>%
    dplyr::select(site_no, dateTime) 

  data_file <- as_data_file(ind_file)
  saveRDS(temp_sites, data_file)
  gd_put(remote_ind = ind_file, local_source = data_file, config_file = gd_config)
}

Guidelines

Projects using a shared cache should follow these guidelines:

Use indicator files (usually an .ind suffix) to represent most or all of the chain of connected targets within your remake files. Each .ind file should be one of two products of a recipe (e.g. R function call), where the other product is the creation of a data file, either locally and/or in the shared cache. The indicator files are the only products that are declared to remake, while the data files remain hidden. This allows the indicator files to indirectly represent the data files to remake thereby enabling compatibility between remake and the shared cache. Targets that probably don't need indicator files are those that are quick to produce (e.g. < 5 seconds, or quicker than downloading from the shared cache) or configuration files.
Always build targets using scipiper::scmake() rather than remake::make(). Though the functions are outwardly very similar, scmake() maintains an extra layer of metadata that allows multiple users to share a single project build status (e.g., "file x.rds.ind is up to date; file y.rds.ind is out of date"). In a shared-cache project, you should not even need to load the remake package directly.
Generally avoid using R objects as shared cache targets...but if you must, usually for convenience or conciseness of the workflow plan, recognize that R objects must be built by every analyst. So if a target takes a non-trivial length of time to build, or if it depends on large volumes of data as input, that target should usually be a file rather than an R object.
git commit all indicator files and build files (with occasional exception of indicator files within a task plan; those require additional thought). git ignore all data files unless they are small enough to store in git/GitHub, such as small, text-based, and typically hand-curated data files (e.g. a data file that matches NHD lake ID's to collaborator-provided data files). Data files that are necessary to pipeline basic functions should also be committed (e.g. gd_config.yml).
To force a rebuild, either use the force=TRUE argument to scmake() or use scdel() to delete indicator files. There's seldom any benefit to deleting data files (by any method); usually deleting the indicator files is plenty. When deleting indicator files, force=TRUE or scdel() are preferable to directly deleting the .ind files because if only the .ind files are deleted, the scipiper database may fail to update properly when the .ind files are rebuilt.

How many targets should I use per data file?

3 target method:

create a data file
push a data file and create an indicator file
retrieve the data file

Advantages

Data and indicator file creation is verbose and clear
Helps with fault tolerance by splitting tasks into multiple targets
This method may be necessary when using standard functions for data creation or retrieval

targets:

  1_data/tmp/nitrate_data_pull.rds.ind:
    command: gather_stream_data(
      file = target_name,
      siteNumber = I('01118500'),
      parameterCd = I('00630'),
      startDate = I('1980-01-01'),
      endDate = I('2016-01-01'))

  1_data/out/nitrate_data_pull.rds.ind:
    command: gd_put(
      remote_ind = target_name,
      local_source = '1_data/tmp/nitrate_data_pull.rds.ind',
      gd_config = 'lib/cfg/gd_config.yml')

  1_data/out/nitrate_data_pull.rds:
    command: gd_get('1_data/out/nitrate_data_pull.rds.ind', config_file = 'lib/cfg/gd_config.yml')

2 target method:

push a data file and create an indicator file
retrieve the data file

Advantages

More concise code and fewer indicator and build status files, thereby reducing the size of the repository

targets:

  1_data/out/compiled_data.rds.ind:
    command: gather_and_push_stream_data(
      ind_file = target_name,
      siteNumber = I('01118500'),
      parameterCd = I('00630'),
      startDate = I('1980-01-01'),
      endDate = I('2016-01-01'), 
      gd_config = 'lib/cfg/gd_config.yml')

  1_data/out/compiled_data.rds:
    command: gd_get('1_data/out/compiled_data.rds.ind', config_file = 'lib/cfg/gd_config.yml')

Pros and Cons

Advantages of a shared cache:

Not every analyst needs to build every target, saving on total processing time.
Targets that can only be built on specific operating systems (e.g., Mac) or in specific computing environments (e.g., a cluster) can still be accessible to all analysts for further analysis.
Intermediate and final products can be immediately visible to anyone who has access to the shared cache, whether they are contributing to the analysis or simply inspecting/using the output.

Disadvantages of a shared cache (as currently implemented):

In a fast-paced collaborative development environment (e.g., a 'sprint'), it is challenging to maintain synchrony between the shared cache (the data) and the git repository (the metadata; indicator and build status files). Asynchrony is not a deal-breaker but does lead to more rebuilding than would be required for a slower-paced project. See Common Pitfalls and Solutions vignette.
Though we've done much to ensure this doesn't happen, it's conceivable that metadata will become corrupt relative to the data. Some monitoring and very occasional full rebuilding is recommended when practical.
Old files no longer referenced by the code can accumulate on the shared cache unless manually deleted. Though these will not interfere with ongoing analysis, they can take up storage space unnecessarily.

aappling-usgs/scipiper documentation built on Aug. 1, 2020, 3:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

aappling-usgs/scipiper
Support functions for ushering data through a scientific workflow

In aappling-usgs/scipiper: Support functions for ushering data through a scientific workflow

Shared Cache {.tabset .tabset-fade .tabset-pills}

Terms and concept

Terms

Concept

Guidelines

Projects using a shared cache should follow these guidelines:

How many targets should I use per data file?

3 target method:

Advantages

2 target method:

Advantages

Pros and Cons

Advantages of a shared cache:

Disadvantages of a shared cache (as currently implemented):

R Package Documentation

Browse R Packages

We want your feedback!

aappling-usgs/scipiper Support functions for ushering data through a scientific workflow

In aappling-usgs/scipiper: Support functions for ushering data through a scientific workflow

Shared Cache {.tabset .tabset-fade .tabset-pills}

Terms and concept

Terms

Concept

Guidelines

Projects using a shared cache should follow these guidelines:

How many targets should I use per data file?

3 target method:

Advantages

2 target method:

Advantages

Pros and Cons

Advantages of a shared cache:

Disadvantages of a shared cache (as currently implemented):

R Package Documentation

Browse R Packages

We want your feedback!

aappling-usgs/scipiper
Support functions for ushering data through a scientific workflow