Alison Appling, July 31, 2018
Jake Zwart, February 4, 2019
output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Shared cache} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}
knitr::opts_chunk$set(echo = TRUE, cache=FALSE, collapse=TRUE)
A shared cache (sc of scmake) is a cloud data storage location where raw, intermediate, and/or final data products from an analysis project are contributed to and accessible by multiple analysts. Not all scipiper projects will use a shared cache.
Data files only need to be local when the analyst is computing with the data file. .ind) represent the remote shared cache among project participants. This allows analyst #1 to compute steps A and B (e.g. streamflow data pull [step A] and aggreagation [step B]), upload output from steps A and B to the project's shared cache, and analyst #2 can use the output from step B without redoing the computing performed by analyst #1.
Workflow dependencies are connected via the scipiper functions such as gd_get() (if data is not already available locally).
Example of .ind file dependency where the function select_sites() (code snippet below gd_get() and the 1_data/out/compiled_data.rds.ind as a dependency to pull down the data file (compiled_data.rds) from the shared cache.
target_default:1_data sources: - 1_data/src/gather_data.R - 1_data/src/select_sites.R
targets:
1_data: depends: -1_data/out/compiled_data.rds.ind -1_data/out/selected_sites.rds.ind
1_data/out/compiled_data.rds.ind: command: gather_and_share_stream_data( ind_file = target_name, state = I("WI"), gd_config = 'lib/cfg/gd_config.yml')
1_data/out/selected_sites.rds.ind: command: select_sites( ind_file = target_name, input_ind_file ='1_data/out/compiled_data.rds.ind' , gd_config = 'lib/cfg/gd_config.yml')
Functions used in the above code snippet:
gather_and_share_stream_data = function(ind_file , state, gd_config){ temp <- readNWISdata(stateCd = state, parameterCd = '00010', service = 'dv') data_file <- as_data_file(ind_file ) # convert indicator file to data file format (drops .ind suffix) saveRDS(temp, data_file) gd_put(remote_ind =ind_file , local_source = data_file, config_file = gd_config) } select_sites = function(ind_file ,input_ind_file , gd_config){ temp = readRDS(sc_retrieve(input_ind_file )) temp_sites <- temp %>% dplyr::filter( dateTime > as.POSIXct('2012-01-01')) %>% dplyr::select(site_no, dateTime) data_file <- as_data_file(ind_file ) saveRDS(temp_sites, data_file) gd_put(remote_ind =ind_file , local_source = data_file, config_file = gd_config) }
Use .ind suffix) to represent most or all of the chain of connected .ind file should be one of two products of a recipe (e.g. R function call), where the other product is the creation of a data file, either locally and/or in the shared cache. The remake, while the data files remain hidden. This allows the remake thereby enabling compatibility between remake and the shared cache.
Always build scipiper::scmake() rather than remake::make(). Though the functions are outwardly very similar, scmake() maintains an extra layer of metadata that allows multiple users to share a single project remake package directly.
Generally avoid using R objects as shared cache
git commit all git ignore all data files unless they are small enough to store in git/GitHub, such as small, text-based, and typically hand-curated data files (e.g. a data file that matches NHD lake ID's to collaborator-provided data files). Data files that are necessary to pipeline basic functions should also be committed (e.g. gd_config.yml).
To force a rebuild, either use the force=TRUE argument to scmake() or use scdel() to delete force=TRUE or scdel() are preferable to directly deleting the .ind files because if only the .ind files are deleted, the scipiper database may fail to update properly when the .ind files are rebuilt.
targets:
1_data/tmp/nitrate_data_pull.rds.ind:
command: gather_stream_data(
file = target_name,
siteNumber = I('01118500'),
parameterCd = I('00630'),
startDate = I('1980-01-01'),
endDate = I('2016-01-01'))
1_data/out/nitrate_data_pull.rds.ind:
command: gd_put(
remote_ind = target_name,
local_source = '1_data/tmp/nitrate_data_pull.rds.ind',
gd_config = 'lib/cfg/gd_config.yml')
1_data/out/nitrate_data_pull.rds:
command: gd_get('1_data/out/nitrate_data_pull.rds.ind' , config_file = 'lib/cfg/gd_config.yml')
targets:1_data/out/compiled_data.rds.ind: command: gather_and_push_stream_data( ind_file = target_name, siteNumber = I('01118500'), parameterCd = I('00630'), startDate = I('1980-01-01'), endDate = I('2016-01-01'), gd_config = 'lib/cfg/gd_config.yml') 1_data/out/compiled_data.rds: command: gd_get('1_data/out/compiled_data.rds.ind' , config_file = 'lib/cfg/gd_config.yml')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.