The goal of dvthis
is to provide utility functions for DVC
pipelines using R scripts.
An additional goal is to document the usual workflows they enable, and provide
a template for projects using DVC and R.
You can install the current development version of dvthis
with
remotes::install_github("jcpsantiago/dvthis")
No version available in CRAN yet.
You can use DVC by itself by running dvc init
within a git repo dir
(read their docs here) and then use the utility functions
to make your life easier.
Or, you can use dvthis
to setup the scaffolding for you.
dvthis
template.
It will have the following folder structure and initiate DVC for you
(DVC must be installed on your system):.
├── data # all data that's not a model, metrics or plots goes here
│ ├── intermediate # outputs of each stage to be used in future stages
│ └── raw # original data; should never be overwritten; saved in remote storage with DVC
├── metrics # metrics of interest in JSON; DVC can track these over time
├── models # final output of your pipeline, in case it's a model
├── plots # any plots produced, including CSVs with data for plots (see DVC docs)
├── queries # .sql files or other format so that queries are also tracked
├── R # additional R functions needed for this project and not in a pkg yet
├── reports # more complete reports or model cards
└── stages # scripts for each stage; doesn't need to be only in R!
This structure assumes a DVC pipeline for Machine Learning made out of multiple stages/*.R
which will
queries/*.sql
data/raw/*.csv
data/intermediate/*.qs
models/*
, some metrics/*.json
and plots/*.png
You are free, of course, to use your own naming conventions, stages, etc.
E.g. maybe you don't have data coming from a database -- just delete the queries
dir,
and instead place your data in data/raw
. Bam!
Since this is an R package, the examples focus on R scripts, but DVC does not care about languages. I have mixed Clojure and R, for example, without ill effects.
Stages should be small and focused, just like you would write your normal R functions.
You can add a new R stage using the add_r_stage
funciton.
For example you could have stages (separate, independent scripts) for:
This way it's possible to experiment and make changes to a smaller amount of code each time. It also enables an interactive workflow e.g. if you want to experiment with a new transformation
read_intermediate_data()
lines to load cached data the stage depends onmutate()
dvc repro
in the terminal to run the pipeline starting at the modified feature transformation script all the way downstreamA stage script could look something like this:
#!/usr/bin/env Rscript
# you may not need command line arguments, but they're helpful in parameterised pipelines
n_of_dragons <- commandArgs(trailingOnly = TRUE)[1]
# assigning it to this_stage by convention will allow stage_footer() to be called without args
this_stage <- dvthis::stage_header("Choosing dragons")
dvthis::log_stage_step("Loading dragon data")
dragons_raw <- dvthis::read_raw_data("dragons.csv", readr::read_csv)
dvthis::log_stage_step("Loading clean kingdom data")
kingdoms <- dvthis::read_intermediate_result("kingdoms")
dvthis::log_stage_step("Keeping only {n_of_dragons} dragons")
dragons_clean <- head(dragons_raw, n_of_dragons)
dragons_and_kingdoms <- dplyr::inner_join(dragons_clean, kingdoms)
# you don't have to save every single intermediate result, but here I want to
# be extensive for documentation sake
dvthis::log_stage_step("Saving intermediate dragons_clean")
dvthis::save_intermediate_result(dragons_clean)
dvthis::log_stage_step("Saving intermediate dragons_clean")
dvthis::save_intermediate_result(dragons_and_kingdoms)
dvthis::stage_footer()
dvthis
also packs RStudio addins with shortcuts to commonly used DVC commands.
I find it useful to bind these to keyboard shortcuts:
Repro
will run dvc repro
.Repro until currently open stage
will run all upstream stages on which the currently open stage script depends.Everyone has their prefered way of working, so maybe dvthis
is not doing exactly what you need. Let me know! I'll also gladly review any feature or bug PRs :)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.