In richfitz/dataverse: Simple Data Versioning

Scope

This package attempts to simultaneously solve a number of problems around small-scale data versioning and distribution:

Giving users access to your data in an easily machine-digestable format.
Hosting and distributing the data somewhere fast and reliable without having to deal with creating websites.
Protecting access to the data to collaborators, especially with the idea of later public release.
Allowing a dataset to be downloaded once and reused for multiple projects on a single computer, without having to deal with pathnames within or between systems.
Versioning the data so that:
- fetching the current version is easy,
- fetching a previous version is easy,
- simultaneously looking at two versions is easy,
- data versions are strongly associated with the code that created them,
- end users do not have to use git,
- large files do not end up clogging up your git repository.
Allows publication of data packages on CRAN without causing problems of large package file downloads.
Provides a common interface for storing and retrieving data that works across diverse underlying data formats (one or many csv files, binary data, phylogenic trees, or a collection of all of these), so long as you have a way of reading the data into R.

The package is designed so be simple to use so that all that can be done in a couple of lines of code, or (for more involved cases) with a package that can be generated automatically.

Background

Data comes in all shapes and sizes, and a one-size fits all solution will not fit everything.

Too small: One-off data sets (e.g. a field experient that will not be updated). Put the data on data dryad, figshare, or wherever you fancy. Stick a fork in it, it's done (though you can use this package you'll likely find it easier not to).
Too big: Massively collaborative datasets with large end users communities, data sets that are so large they require access via APIs, data with access control requiring complex authentication layers, data with complex metadata where access is related to metadata. There are more comprehensive solutions for your data but identifying the correct solution may depend on the data.
Just right: A data set of medium size (say, under 100 MB), that is under moderate levels of change (either stabilising or a "living database" that is continually being updated).

Getting data into R is typically done with a data package. This works well for small data, but CRAN will not generally allow distribution of "large" data sets. The data() loading mechanism of R always seemed a bit of a weird historical quirk in any case; it operates in some additional namespace (package:datasets), works by modifying an environment as a side-effect. Plus if you need to compare two versions of the data you have to do some gynmastics to install two different versions of a package (or create a package with all the different versions of the data in it).

How `datastorr` works

GitHub has a "releases" feature for allowing (potentially large) file uploads. These files can be any format. GitHub releases are build off of git "tags"; they are associated with a specific version. So if you have code that creates or processes a dataset, the dataset will be stored against the code used to create it, which is nice. GitHub releases do not store the file in the repository. This avoids issues with git slowing down on large files, on lengthy clone times, and on distributing and installing your package. This could be an issue if you had 100 versions of a 10 MB dataset; that could be 1GB of data to clone or install. But storing your data against GitHub releases will leave the data in the cloud until it is needed. And the files can be quite large; up to 2GB.

The releases will be numbered. We recommend semantic versioning mostly because it signals some intent about changes to the data (see below). If the data is not going to change, that's not a problem - the version can just be v1.0.0 forever (chances are it will change though!).

We will make the simplifying assumption that your data set will be stored in a single file. In practice this is not a large limitation because that file could be a zip archive. The file can be in any format; csv, rds (R's internal format), a SQLite database. You, however, need to specify or provide a function that will read the data and convert it into an R object. This is most easily done with rds files (R's serialisation format -- though note they say that it is not a great long-term archival format [see ?serialize]).

To orchestrate getting the data from github to R we need to add a little metadata about what the file will be called and how it should be loaded into R. This can be done most simply with a small json file at the root of the repository containing information like:

{
    "filename": "myfile.rds",
    "read": "base::readRDS"
}

Note that the function used here must take a filename as an argument and return an R object. So functions like read.csv, read.table and functions from the rio package may be good here.

Once your git repository is set up, the metadata file added to it, and a release with data has been created, it can be downloaded like:

d <- datastorr::datastorr("richfitz/data")

though, with your username/repo pair instead of richfitz/data.

This function is designed to be fast for users, and so suitable for using in scripts. It uses storr behind the scenes and looks in various places for the data:

In memory; if it has been loaded within this session it is already in memory. Takes on the order of microseconds.
From disk; if the data has ever been loaded datastorr will cache a copy on disk. Takes on the order of milliseconds up to a second, depending on the size of the data.
From GitHub; if the data has never been loaded, it will be downloaded from GitHub, saved to disk, and loaded to memory. This will take several seconds or longer depending on the size of the dataset.

In addition, users can download specific versions of a dataset. This might be to synchronise data versions across different people in a project, to lock a project onto a specific version, etc:

d_old <- datastorr::datastorr("richfitz/data", version="1.0.0")

(The same cascading lookup as above is used.)

Versions can be listed; those stored locally:

datastorr::datastorr_versions("richfitz/data")

or available remotely:

datastorr::datastorr_versions("richfitz/data", local=FALSE)

The versions that have been downloaded (here d and d_old) are just normal R objects Unlike use with data() there's no ambiguity about where they are stored, and modifying one acts like any other object.

Similarly, because these are ordinary R objects you can do things like use daff to compare them

p <- daff::diff_data(d_old, d)
daff::render_diff(p)

The package interface

Alternatively we can create a very small R package that exists at the repo that we store releases against. This package can be autogenerated, and is a useful approach when there is a significant amount of work needed in processing the data, to simplify installation of dependencies used in reading or displaying the data, or to work with the data once it has been downloaded.In our own use, the repository (but not the package) contains code for building the data set (see taxonlookup). The package approach will be described more fully later in the document.

Once your git repsitory is published and your data have been released, downloading it becomes a function within your package. A user would run something like:

d <- mypackage::mydata()

to fetch or load the data.

d <- mypackage::mydata("v1.0.0")

This approach extends to holding multiple versions of the data on a single computer (or in a single R session). This might be useful when the dataset has changed and you want to see what has changed.

d1 <- mypackage::mydata("v1.0.0")
d2 <- mypackage::mydata("v1.1.0")
## ...compare d1 and d2 here...

Worked example

First, you will need a package. Creating packages is not that hard, especialy with tools like devtools and mason. Packages make running R code on other machines much simpler than sourcing in files or copy and paste. Packages are also nice because if your data require specific package to work with (e.g., ape for phylogenetic trees) you can declare them in your DESCRIPTION file and R will ensure that they are installed when your package is installed and loaded when your package is used.

However, you will need to come up with a few details:

a package name
a name for the dataset (if different to the package name)
a licence for your package (code) and data (not code)
ideally, documentation for your end users
the name of the file that you will store with each release

In addition you need to set up a GitHub token so that you can upload files to GitHub from R, or to access your private repositories; see the section on authentication below, or just do nothing as datastorr will prompt you at the appropriate time.

The core code can be autogenerated. For example the package datastorr.example was generated using ``` {r eval=FALSE} datastorr::autogenerate("richfitz/datastorr.example", "readRDS", name="mydata", roxygen=FALSE)

``` {r echo=FALSE, results="asis"}
pkg <- datastorr::autogenerate("richfitz/datastorr.example", "readRDS",
                               name="mydata", roxygen=FALSE)
writeLines(c("```r", pkg,  "```"))

This code can be copied into a file within the package. If you set roxygen=TRUE you'll get roxygen help that devtools::document() will convert into R help files and NAMESPACE declarations.

The package can then be loaded and data accessed with the mydata function.

To make the release:

Increase the version number in your DESCRIPTION file
Your local repo is all committed (no unstaged files etc). This is important if you want to closely associate the release and your data and at the moment datastorr enforces it.
Push your changes to GitHub and install your package
Run yourpackage::yourdata_release("A description here")
Check that it all worked by running yourpackage::yourdata("new version")

(you can get your new version by read.dcf("DESCRIPTION")[, "Version"]).

Access control

Because GitHub offers private repositories, this gives some primitive, but potentailly useful, access control. Because datastorr uses GitHub's authentication, GitHub knows if the user has access to private repositories. Therefore for this to work you will need to authenticate datastorr to work with GitHub.

The simplest way to do this is to let datastorr prompt you when access is required. Or run:

datastorr::datastorr_auth()

to force the authentication process to run (no error and no output indicates success). To force using personal access tokens rather than OAuth, run:

setup_github_token()

which will walk you through the steps of setting a token up.

If you use a personal private repository, then you invite other users to "collaborate" on the repository. Note that this gives the users push access to the repository; the access control is very coarse.

If you have an organisation account you can create groups of users that have read only access to particular repositories, which will likely scale better.

Semantic versioning of data

Some will argue that it is not possible and they are probably right. But you need to go with some versioning system. If the idea of semantically versioning data bothers you, use incrementing integers (v1, v2, v<n>) and read no further!

The idea with semantic versioning is that it formalises what people do already with versioning. We feel this can be applied fairly successfully to data.

Update patch release; small changes, backward compatible.
- adding new rows to the data set (more data)
- error correcting existing data
Update minor version; medium changes, but generally backward compatible.
- new columns
- substantial new data
- new tables
Update major version; large (API) changes, likely to be backward incompatible.
- renaming or deleting columns
- changing variable coding
- deleting large amounts of data

Forks make this a lot more complicated. If two people are working in parallel how do they decide what version number to use? However, with our solution, the datasets are still sensibly named; we have:

user1/dataset@v1.2.3
user2/dataset@v1.3.5

It's just not possible to know from the outside exactly what differs between the datasets but they are at least distinctly named (and you could download both of them). When the fork is resolved and user2 merges back into user1 the two researchers can discuss what version number they would want to use. Like resolving merge conflicts, we see this as a social problem, not a technological one and the soltuion will be social.

Beyond GitHub

Apart from the ease of use, mindshare and the explicit association between data and code, there is no strong reason to use GitHub here. Certainly Bitbucket provides all the same functionality that is required to generalise our approach to work there. And self hosting would work too, with more effort. Over time we may develop support for alternative storage providers.

At the same time, the fast and generally reliable webserver, the access controls and the nice API make it a great first place to try this proof of concept.

How it actually works

GitHub has an API that lets you programmatically query the state of basically everything on GitHub, as well create things. So the interaction with the website is straightforward; getting lists of releases for a repository, filenames associated with releases, etc.

With this information, datastorr uses a storr_external object and stores data with versions as keys. If a version is not found it is downloaded (using the information from GitHub) and read into R using the read function. A copy of this R-readable verison is saved to disk.

In order to save and load data repeatedly, especially across different projects on the same computer, datastorr uses the rappdirs package to find the "Right Place" to store "application data". This varies by system and is documented in the ?rappdirs::user_data_dir help page. Using this directory means there is little chance of accidently commiting large data sets into the repository (which might be a problem if storing the data in a subdirectory of the project).

richfitz/dataverse documentation built on July 9, 2021, 12:08 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com