Alison Appling, July 31, 2018
Jake Zwart, February 4, 2019
output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Shared cache} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}
knitr::opts_chunk$set(echo = TRUE, cache=FALSE, collapse=TRUE)
A shared cache (sc of scmake
) is a cloud data storage location where raw, intermediate, and/or final data products from an analysis project are contributed to and accessible by multiple analysts. Not all scipiper projects will use a shared cache.
Data files only need to be local when the analyst is computing with the data file. .ind
) represent the remote shared cache among project participants. This allows analyst #1 to compute steps A
and B
(e.g. streamflow data pull [step A
] and aggreagation [step B
]), upload output from steps A
and B
to the project's shared cache, and analyst #2 can use the output from step B
without redoing the computing performed by analyst #1.
Workflow dependencies are connected via the scipiper
functions such as gd_get()
(if data is not already available locally).
Example of .ind
file dependency where the function select_sites()
(code snippet below gd_get()
and the 1_data/out/compiled_data.rds.ind
as a dependency to pull down the data file (compiled_data.rds
) from the shared cache.
target_default:1_data sources: - 1_data/src/gather_data.R - 1_data/src/select_sites.R
targets:
1_data: depends: -1_data/out/compiled_data.rds.ind -1_data/out/selected_sites.rds.ind
1_data/out/compiled_data.rds.ind: command: gather_and_share_stream_data( ind_file = target_name, state = I("WI"), gd_config = 'lib/cfg/gd_config.yml')
1_data/out/selected_sites.rds.ind: command: select_sites( ind_file = target_name, input_ind_file ='1_data/out/compiled_data.rds.ind' , gd_config = 'lib/cfg/gd_config.yml')
Functions used in the above code snippet:
gather_and_share_stream_data = function(ind_file , state, gd_config){ temp <- readNWISdata(stateCd = state, parameterCd = '00010', service = 'dv') data_file <- as_data_file(ind_file ) # convert indicator file to data file format (drops .ind suffix) saveRDS(temp, data_file) gd_put(remote_ind =ind_file , local_source = data_file, config_file = gd_config) } select_sites = function(ind_file ,input_ind_file , gd_config){ temp = readRDS(sc_retrieve(input_ind_file )) temp_sites <- temp %>% dplyr::filter( dateTime > as.POSIXct('2012-01-01')) %>% dplyr::select(site_no, dateTime) data_file <- as_data_file(ind_file ) saveRDS(temp_sites, data_file) gd_put(remote_ind =ind_file , local_source = data_file, config_file = gd_config) }
Use .ind
suffix) to represent most or all of the chain of connected .ind
file should be one of two products of a recipe (e.g. R function call), where the other product is the creation of a data file, either locally and/or in the shared cache. The remake
, while the data files remain hidden. This allows the remake
thereby enabling compatibility between remake
and the shared cache.
Always build scipiper::scmake()
rather than remake::make()
. Though the functions are outwardly very similar, scmake()
maintains an extra layer of metadata that allows multiple users to share a single project remake
package directly.
Generally avoid using R objects as shared cache
git commit
all git ignore
all data files unless they are small enough to store in git/GitHub, such as small, text-based, and typically hand-curated data files (e.g. a data file that matches NHD lake ID's to collaborator-provided data files). Data files that are necessary to pipeline basic functions should also be committed (e.g. gd_config.yml).
To force a rebuild, either use the force=TRUE
argument to scmake()
or use scdel()
to delete force=TRUE
or scdel()
are preferable to directly deleting the .ind
files because if only the .ind
files are deleted, the scipiper
database may fail to update properly when the .ind
files are rebuilt.
targets: 1_data/tmp/nitrate_data_pull.rds.ind: command: gather_stream_data( file = target_name, siteNumber = I('01118500'), parameterCd = I('00630'), startDate = I('1980-01-01'), endDate = I('2016-01-01'))1_data/out/nitrate_data_pull.rds.ind: command: gd_put( remote_ind = target_name, local_source = '1_data/tmp/nitrate_data_pull.rds.ind', gd_config = 'lib/cfg/gd_config.yml') 1_data/out/nitrate_data_pull.rds: command: gd_get('1_data/out/nitrate_data_pull.rds.ind' , config_file = 'lib/cfg/gd_config.yml')
targets:1_data/out/compiled_data.rds.ind: command: gather_and_push_stream_data( ind_file = target_name, siteNumber = I('01118500'), parameterCd = I('00630'), startDate = I('1980-01-01'), endDate = I('2016-01-01'), gd_config = 'lib/cfg/gd_config.yml') 1_data/out/compiled_data.rds: command: gd_get('1_data/out/compiled_data.rds.ind' , config_file = 'lib/cfg/gd_config.yml')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.