Shared Cache {.tabset .tabset-fade .tabset-pills}

Alison Appling, July 31, 2018
Jake Zwart, February 4, 2019


output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Shared cache} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}


knitr::opts_chunk$set(echo = TRUE, cache=FALSE, collapse=TRUE)

Terms and concept

Terms

Concept

A shared cache (sc of scmake) is a cloud data storage location where raw, intermediate, and/or final data products from an analysis project are contributed to and accessible by multiple analysts. Not all scipiper projects will use a shared cache.

Data files only need to be local when the analyst is computing with the data file. Indicator files (.ind) represent the remote shared cache among project participants. This allows analyst #1 to compute steps A and B (e.g. streamflow data pull [step A] and aggreagation [step B]), upload output from steps A and B to the project's shared cache, and analyst #2 can use the output from step B without redoing the computing performed by analyst #1.

Workflow dependencies are connected via the indicator files. Recipes (e.g. R function) that push data files to the shared cache create indicator files, and these indicator files can be used in other recipes to pull down data files from the shared cache using scipiper functions such as gd_get() (if data is not already available locally).

Build files tell remake if the indicator files and their dependencies are up-to-date. Targets that are not saved as indicator or data files (e.g. R objects) will not have a build status file and will be excuted every time it is called.

Example of .ind file dependency where the function select_sites() (code snippet below yaml code) uses gd_get() and the indicator file 1_data/out/compiled_data.rds.ind as a dependency to pull down the data file (compiled_data.rds) from the shared cache.

target_default: 1_data

sources: - 1_data/src/gather_data.R - 1_data/src/select_sites.R

targets:

1_data: depends: - 1_data/out/compiled_data.rds.ind - 1_data/out/selected_sites.rds.ind

1_data/out/compiled_data.rds.ind: command: gather_and_share_stream_data( ind_file = target_name, state = I("WI"), gd_config = 'lib/cfg/gd_config.yml')

1_data/out/selected_sites.rds.ind: command: select_sites( ind_file = target_name, input_ind_file = '1_data/out/compiled_data.rds.ind', gd_config = 'lib/cfg/gd_config.yml')

Functions used in the above code snippet:

 
gather_and_share_stream_data = function(ind_file, state, gd_config){

  temp <- readNWISdata(stateCd = state, parameterCd = '00010', service = 'dv')

  data_file <- as_data_file(ind_file) # convert indicator file to data file format (drops .ind suffix)
  saveRDS(temp, data_file)
  gd_put(remote_ind = ind_file, local_source = data_file, config_file = gd_config)
}

select_sites = function(ind_file, input_ind_file, gd_config){

  temp = readRDS(sc_retrieve(input_ind_file, 'remake.yml')) # default remake_file is getters.yml as of 10/10/20

  temp_sites <- temp %>%
    dplyr::filter(
      dateTime > as.POSIXct('2012-01-01')) %>%
    dplyr::select(site_no, dateTime) 

  data_file <- as_data_file(ind_file)
  saveRDS(temp_sites, data_file)
  gd_put(remote_ind = ind_file, local_source = data_file, config_file = gd_config)
}

Guidelines

Projects using a shared cache should follow these guidelines:

How many targets should I use per data file?

3 target method:
  1. create a data file
  2. push a data file and create an indicator file
  3. retrieve the data file
Advantages
targets:

  1_data/tmp/nitrate_data_pull.rds.ind:
    command: gather_stream_data(
      file = target_name,
      siteNumber = I('01118500'),
      parameterCd = I('00630'),
      startDate = I('1980-01-01'),
      endDate = I('2016-01-01'))

  1_data/out/nitrate_data_pull.rds.ind:
    command: gd_put(
      remote_ind = target_name,
      local_source = '1_data/tmp/nitrate_data_pull.rds.ind',
      gd_config = 'lib/cfg/gd_config.yml')

  1_data/out/nitrate_data_pull.rds:
    command: gd_get('1_data/out/nitrate_data_pull.rds.ind', config_file = 'lib/cfg/gd_config.yml')

2 target method:
  1. push a data file and create an indicator file
  2. retrieve the data file
Advantages
targets:

  1_data/out/compiled_data.rds.ind:
    command: gather_and_push_stream_data(
      ind_file = target_name,
      siteNumber = I('01118500'),
      parameterCd = I('00630'),
      startDate = I('1980-01-01'),
      endDate = I('2016-01-01'), 
      gd_config = 'lib/cfg/gd_config.yml')

  1_data/out/compiled_data.rds:
    command: gd_get('1_data/out/compiled_data.rds.ind', config_file = 'lib/cfg/gd_config.yml')

Pros and Cons

Advantages of a shared cache:

Disadvantages of a shared cache (as currently implemented):



USGS-R/scipiper documentation built on May 25, 2023, 8:47 a.m.