knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(archiveR)
One issue that I've often faced in analytics is how to connect output (data, visualizations, and reports) with the code used to generate that output. The team will deliver output to the client/partner and then continue to develop the code base. If the client/partner has a specific question then the team has to back-track to figure out which code was used to create that particular output. A lot of teams archive their output use timestamps, e.g. data_20171019_124590. While this can be a good method for data archiving for some teams, this method doesn't easily allow an analyst to be able to backtrack and see which code created that specific output. This can also create issues if data is created and then used as an input farther downstream because it means that the filename is changing whenever the data are re-processed.
The archiveR package is intended to help teams archive their data and output and connect it to the code that was used to create the output. The code is connected through the output using the hash from the commit. The package builds on functionalities from the git2r package.
First, we create a file structure with Current
and Archive
folders, if those folders don't currently exist. Then we add specified output files to the Current
folder. Then the files from the Current
folder are zipped and the zipped file is named after the hash commit used to generate the output.
I recommend using this Archiving structure whenever you're producing output, but absolutely if you're producing output for a client/partner.
Note that data/output in the Current
folder matches the most recently added zip file in the Archive
folder.
This may seem like overkill when it comes to data and output archiving. A lot of teams, simply add a timestamp to their output. However, this system of archiving data and output solves a lot of pesky problems.
The first problem it solves is that the both the data and the output can now always be traced back to the exact code. So if there are ever any questions about how a dataset was created or which data were used to create output, it's very easy to know.
While timestamping data can be helpful, it's very irritating if those data use downstream. Then the person maintaining the code must remember to change the input filename whenever the data get re-processed. Now, processes which are downstream can simply point at specific files within the Current
folder.
If your team processes data on a semi-regularly basis and frequently goes back and utilizes those data, it may be worthwhile to come up with a system of how to name commits. A team I worked on previously processed data monthly, so it our automatted archiving system, we would name the archives branch_commitHash_message. The message would be something like, "April Monthly" or "May Monthly". That way, if we were looking in the archives, and wanted to use data from May it was very easy to see (without actually going to the codebase) which commit message we would use in the extract process.
If this sounds applicable to your team, add the --message
(-m
) tag in the command line and it will be added to the end of the archive name.
Data/ Processed/ Current/ data.csv summary_data.csv Archive/ master_90r68d.zip Raw/ Deliverables/ Current/ fig1.png fig2.png Archive/ master_90r68d.zip
Sample Code is located ./R/sample_code.R
library(utilsR) library(archiveR) library(dplyr) output_dir <- "./tests/testthat" utilsR::create_dirs(file.path(output_dir, c("Data", "Deliverables"))) data_fls <- file.path(output_dir, "Data", c("data.csv", "data_summary.csv")) plot_fls <- file.path(output_dir, "Deliverables", c("hist_df.png", "hist_df.jpg", "hist_df.tiff")) ## Functions dm <- function(data_fls) { #' Data Management #' @param data_fls the files to archive df <- data.frame(col1 = c(1:4), col2 = c("a", "b", "c", "d")) df$new_col <- ifelse(df$col1 > 1, 1, 0) write.csv(x = df, file = data_fls[1]) return(df) } dm_sum <- function(df, data_fls){ #' Summarize df #' @param df the dataframe #' @param data_fls the names of the data files to write out df_sum <- df %>% dplyr::group_by(new_col) %>% dplyr::summarize(n = dplyr::n()) write.csv(x = df, file = data_fls[2]) return(df_sum) } plots <- function(df, plot_fls) { #' Plot Sample Data #' @param df the dataframe #' @param data_fls the names of the data files to write out png(filename = plot_fls[1]) hist(df$new_col) dev.off() jpeg(filename = plot_fls[2]) hist(df$new_col) dev.off() tiff(filename = plot_fls[3]) hist(df$new_col) dev.off() } ## Run df <- dm(data_fls) df_sum <- dm_sum(df, data_fls) plots(df, plot_fls) archive_etl(file.path(output_dir, "Data"), data_fls) archive_etl(file.path(output_dir, "Deliverables"), plot_fls)
Develop codebase normally. Add and commit code to bitbucket/TFS and utilize pull requests. When it comes time to deliver content to a client/partner then use the archiveR library. At this point the code should be code reviewed and tested and be ready to merge into master.
cd archiveR
Rscript R/sample_code.R
To write out data to the Current folder and Archive the data type Rscript R/sample_code.R -c "commit message" -n "holmesjoli" -p "gitHubPassword" -e "holmesjoli@gmail.com"
Committing the code requires the arguments -n, -p, and -e to be added as well
--commit
(-c
) command triggers the script to add the data/output to the Archive folder--add_branch
(-b
) command is an optional argument. Including this command will add the branch name to the beginning of the archive zip filename. --add_message
(-m
) command is an optional argument. Including this command will add the message to the end of the archive zip filename. To extract data from an archive use the following sample code. Right now the archiveR supports extract of two file types: tab-delimited files and comma-delimited files. The extract a comma delimited file use the extract_csv
function. To extract a tab-delimited file use the extract_tsv
function.
output_dir <- "./tests/testthat/Data" commit <- "bd429f" fl <- "data.csv" df <- archiveR::extract_csv(output_dir, commit, fl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.