datafolder-package: datafolder: List and check contents of data subfolder

Description Overview Functions Example workflow Location

Description

This R package is intended as a light-weight solution to the following problem: You can sync code to a git remote, but do not want to sync the actual data. It creates a list of the contents of a data subfolder, writes this to a csv file which can be synced to git remotes, with functions to check for mismatches between this .csv and the local data folder contents.

Overview

datafolder_update creates a list of the file names and their md5 hashes for all files in a data folder (and all subfolders), and writes this to a .csv file (default is docs/data_folder_content.csv). This csv file can then be synced to git remotes - this way although the data itself is not synced, the data files expected by the code at each commit is documented.

When pulling the remote we can check that the contents of our local repository data folder matches this list. datafolder_check prints any mismatches in some detail - which files are missing, which files appear in data/ but not the csv list, which have been renamed and so on - allowing you to manually copy the data to match what the code expects through some other secure channel. If run with datafolder_check(stop_on_error = FALSE raises a warning when mismatches occur instead of an error.

This doesn't work very well with projects where the data is frequently updated in which case an alternative solution is probably appropriate. These functions assume that your R project is structured with a main project working directory, and all data is located within a specific subfolder inside this project (i.e. data/), and there is another folder for general documentation / configuration (i.e. docs/). For an example structure see http://projecttemplate.net/architecture.html

Functions

datafolder_update: Write docs/data_folder_content.csv listing the files in data/ and their md5 hashes

datafolder_check: Check docs/data_folder_content.csv against the actual data/ files and list any mismatches

Example workflow

Location

https://github.com/bosefalk/datafolder

devtools::install_github("bosefalk/datafolder")


bosefalk/datafolder documentation built on Sept. 11, 2019, 3:52 p.m.