knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(archiveR)

Introduction

One issue that I've often faced in analytics is how to connect output (data, visualizations, and reports) with the code used to generate that output. The team will deliver output to the client/partner and then continue to develop the code base. If the client/partner has a specific question then the team has to back-track to figure out which code was used to create that particular output. A lot of teams archive their output use timestamps, e.g. data_20171019_124590. While this can be a good method for data archiving for some teams, this method doesn't easily allow an analyst to be able to backtrack and see which code created that specific output. This can also create issues if data is created and then used as an input farther downstream because it means that the filename is changing whenever the data are re-processed.

The archiveR package is intended to help teams archive their data and output and connect it to the code that was used to create the output. The code is connected through the output using the hash from the commit. The package builds on functionalities from the git2r package.

First, we create a file structure with Current and Archive folders, if those folders don't currently exist. Then we add specified output files to the Current folder. Then the files from the Current folder are zipped and the zipped file is named after the hash commit used to generate the output.

I recommend using this Archiving structure whenever you're producing output, but absolutely if you're producing output for a client/partner.

Note that data/output in the Current folder matches the most recently added zip file in the Archive folder.

Use Cases

This may seem like overkill when it comes to data and output archiving. A lot of teams, simply add a timestamp to their output. However, this system of archiving data and output solves a lot of pesky problems.

Links code to data and output

The first problem it solves is that the both the data and the output can now always be traced back to the exact code. So if there are ever any questions about how a dataset was created or which data were used to create output, it's very easy to know.

Gets rid of timestamps

While timestamping data can be helpful, it's very irritating if those data use downstream. Then the person maintaining the code must remember to change the input filename whenever the data get re-processed. Now, processes which are downstream can simply point at specific files within the Current folder.

Tag processes

If your team processes data on a semi-regularly basis and frequently goes back and utilizes those data, it may be worthwhile to come up with a system of how to name commits. A team I worked on previously processed data monthly, so it our automatted archiving system, we would name the archives branch_commitHash_message. The message would be something like, "April Monthly" or "May Monthly". That way, if we were looking in the archives, and wanted to use data from May it was very easy to see (without actually going to the codebase) which commit message we would use in the extract process.

If this sounds applicable to your team, add the --message (-m) tag in the command line and it will be added to the end of the archive name.

Example File Structure

Data/
    Processed/
        Current/
            data.csv
            summary_data.csv
        Archive/
            master_90r68d.zip
    Raw/
Deliverables/
    Current/
        fig1.png
        fig2.png
    Archive/
        master_90r68d.zip

Sample Workflow and Code

Sample Autocommit Code

Sample Code is located ./R/sample_code.R

library(utilsR)
library(archiveR)
library(dplyr)

output_dir <- "./tests/testthat"

utilsR::create_dirs(file.path(output_dir, c("Data", "Deliverables")))
data_fls <- file.path(output_dir, "Data", c("data.csv", "data_summary.csv"))
plot_fls <- file.path(output_dir, "Deliverables", c("hist_df.png", "hist_df.jpg", "hist_df.tiff"))

## Functions

dm <- function(data_fls) {
    #' Data Management
    #' @param data_fls the files to archive


    df <- data.frame(col1 = c(1:4), col2 = c("a", "b", "c", "d"))
    df$new_col <- ifelse(df$col1 > 1, 1, 0)

    write.csv(x = df, file = data_fls[1])

    return(df)
}

dm_sum <- function(df, data_fls){
  #' Summarize df
  #' @param df the dataframe
  #' @param data_fls the names of the data files to write out

  df_sum <- df %>%
    dplyr::group_by(new_col) %>%
    dplyr::summarize(n = dplyr::n())

  write.csv(x = df, file = data_fls[2])

  return(df_sum)
}

plots <- function(df, plot_fls) {
  #' Plot Sample Data
  #' @param df the dataframe
  #' @param data_fls the names of the data files to write out

  png(filename = plot_fls[1])
  hist(df$new_col)
  dev.off()

  jpeg(filename = plot_fls[2])
  hist(df$new_col)
  dev.off()

  tiff(filename = plot_fls[3])
  hist(df$new_col)
  dev.off()

}

## Run

df <- dm(data_fls)
df_sum <- dm_sum(df, data_fls)
plots(df, plot_fls)
archive_etl(file.path(output_dir, "Data"), data_fls)
archive_etl(file.path(output_dir, "Deliverables"), plot_fls)

Workflow

Develop codebase normally. Add and commit code to bitbucket/TFS and utilize pull requests. When it comes time to deliver content to a client/partner then use the archiveR library. At this point the code should be code reviewed and tested and be ready to merge into master.

  1. Navigate the repository's root directory, e.g. cd archiveR
  2. To write out data to the Current folder type Rscript R/sample_code.R
  3. To write out data to the Current folder and Archive the data type Rscript R/sample_code.R -c "commit message" -n "holmesjoli" -p "gitHubPassword" -e "holmesjoli@gmail.com"

  4. Committing the code requires the arguments -n, -p, and -e to be added as well

  5. Adding the --commit (-c) command triggers the script to add the data/output to the Archive folder
  6. Adding the --add_branch (-b) command is an optional argument. Including this command will add the branch name to the beginning of the archive zip filename.
  7. Adding the --add_message (-m) command is an optional argument. Including this command will add the message to the end of the archive zip filename.

Extract Data

To extract data from an archive use the following sample code. Right now the archiveR supports extract of two file types: tab-delimited files and comma-delimited files. The extract a comma delimited file use the extract_csv function. To extract a tab-delimited file use the extract_tsv function.

output_dir <- "./tests/testthat/Data"
commit <- "bd429f"
fl <- "data.csv"

df <- archiveR::extract_csv(output_dir, commit, fl)


holmesjoli/archiveR documentation built on May 11, 2019, 3:05 p.m.