This package attempts to simultaneously solve a number of problems around small-scale data versioning and distribution:
Giving users access to your data in an easily machine-digestable format.
Hosting and distributing the data somewhere fast and reliable without having to deal with creating websites.
Protecting access to the data to collaborators, especially with the idea of later public release.
Allowing a dataset to be downloaded once and reused for multiple projects on a single computer, without having to deal with pathnames within or between systems.
Versioning the data so that:
Allows publication of data packages on CRAN without causing problems of large package file downloads.
Provides a common interface for storing and retrieving data that works across diverse underlying data formats (one or many csv files, binary data, phylogenic trees, or a collection of all of these), so long as you have a way of reading the data into R.
The package is designed so be simple to use so that all that can be done in a couple of lines of code, or (for more involved cases) with a package that can be generated automatically.
Data comes in all shapes and sizes, and a one-size fits all solution will not fit everything.
Too small: One-off data sets (e.g. a field experient that will not be updated). Put the data on data dryad, figshare, or wherever you fancy. Stick a fork in it, it's done (though you can use this package you'll likely find it easier not to).
Too big: Massively collaborative datasets with large end users communities, data sets that are so large they require access via APIs, data with access control requiring complex authentication layers, data with complex metadata where access is related to metadata. There are more comprehensive solutions for your data but identifying the correct solution may depend on the data.
Just right: A data set of medium size (say, under 100 MB), that is under moderate levels of change (either stabilising or a "living database" that is continually being updated).
Getting data into R is typically done with a data
package.
This works well for small data, but CRAN will not generally
allow
distribution of "large" data sets. The data()
loading mechanism
of R always seemed a bit of a weird historical quirk in any case;
it operates in some additional namespace (package:datasets
),
works by modifying an environment as a side-effect. Plus if you
need to compare two versions of the data you have to do some
gynmastics to install two different versions of a package (or
create a package with all the different versions of the data in
it).
datastorr
worksGitHub has a "releases" feature for allowing (potentially large) file uploads. These files can be any format. GitHub releases are build off of git "tags"; they are associated with a specific version. So if you have code that creates or processes a dataset, the dataset will be stored against the code used to create it, which is nice. GitHub releases do not store the file in the repository. This avoids issues with git slowing down on large files, on lengthy clone times, and on distributing and installing your package. This could be an issue if you had 100 versions of a 10 MB dataset; that could be 1GB of data to clone or install. But storing your data against GitHub releases will leave the data in the cloud until it is needed. And the files can be quite large; up to 2GB.
The releases will be numbered. We recommend semantic
versioning mostly because it signals some
intent about changes to the data (see below). If the data is not
going to change, that's not a problem - the version can just be
v1.0.0
forever (chances are it will change though!).
We will make the simplifying assumption that your data set will be
stored in a single file. In practice this is not a large
limitation because that file could be a zip archive. The file can
be in any format; csv, rds (R's internal format), a SQLite
database. You, however, need to specify or provide a function that
will read the data and convert it into an R object. This is most
easily done with rds
files (R's serialisation format -- though
note they say that it is not a great long-term archival format [see
?serialize
]).
To orchestrate getting the data from github to R we need to add a little metadata about what the file will be called and how it should be loaded into R. This can be done most simply with a small json file at the root of the repository containing information like:
{ "filename": "myfile.rds", "read": "base::readRDS" }
Note that the function used here must take a filename as an
argument and return an R object. So functions like read.csv
,
read.table
and functions from the
rio
package may be good here.
Once your git repository is set up, the metadata file added to it, and a release with data has been created, it can be downloaded like:
d <- datastorr::datastorr("richfitz/data")
though, with your username/repo pair instead of richfitz/data
.
This function is designed to be fast for users, and so suitable
for using in scripts. It uses
storr
behind the scenes and
looks in various places for the data:
In memory; if it has been loaded within this session it is already in memory. Takes on the order of microseconds.
From disk; if the data has ever been loaded datastorr will cache a copy on disk. Takes on the order of milliseconds up to a second, depending on the size of the data.
From GitHub; if the data has never been loaded, it will be downloaded from GitHub, saved to disk, and loaded to memory. This will take several seconds or longer depending on the size of the dataset.
In addition, users can download specific versions of a dataset. This might be to synchronise data versions across different people in a project, to lock a project onto a specific version, etc:
d_old <- datastorr::datastorr("richfitz/data", version="1.0.0")
(The same cascading lookup as above is used.)
Versions can be listed; those stored locally:
datastorr::datastorr_versions("richfitz/data")
or available remotely:
datastorr::datastorr_versions("richfitz/data", local=FALSE)
The versions that have been downloaded (here d
and d_old
) are
just normal R objects Unlike use with data()
there's no ambiguity
about where they are stored, and modifying one acts like any other
object.
Similarly, because these are ordinary R objects you can do things
like use daff
to compare them
p <- daff::diff_data(d_old, d) daff::render_diff(p)
Alternatively we can create a very small R package that exists at the repo that we store releases against. This package can be autogenerated, and is a useful approach when there is a significant amount of work needed in processing the data, to simplify installation of dependencies used in reading or displaying the data, or to work with the data once it has been downloaded.In our own use, the repository (but not the package) contains code for building the data set (see taxonlookup). The package approach will be described more fully later in the document.
Once your git repsitory is published and your data have been released, downloading it becomes a function within your package. A user would run something like:
d <- mypackage::mydata()
to fetch or load the data.
d <- mypackage::mydata("v1.0.0")
This approach extends to holding multiple versions of the data on a single computer (or in a single R session). This might be useful when the dataset has changed and you want to see what has changed.
d1 <- mypackage::mydata("v1.0.0") d2 <- mypackage::mydata("v1.1.0") ## ...compare d1 and d2 here...
First, you will need a package. Creating packages is not that
hard, especialy with tools like
devtools and
mason. Packages make
running R code on other machines much simpler than sourcing in
files or copy and paste. Packages are also nice because if your
data require specific package to work with (e.g., ape
for
phylogenetic trees) you can declare them in your DESCRIPTION
file
and R will ensure that they are installed when your package is
installed and loaded when your package is used.
However, you will need to come up with a few details:
In addition you need to set up a GitHub token so that you can upload files to GitHub from R, or to access your private repositories; see the section on authentication below, or just do nothing as datastorr will prompt you at the appropriate time.
The core code can be autogenerated. For example the package datastorr.example was generated using ``` {r eval=FALSE} datastorr::autogenerate("richfitz/datastorr.example", "readRDS", name="mydata", roxygen=FALSE)
``` {r echo=FALSE, results="asis"} pkg <- datastorr::autogenerate("richfitz/datastorr.example", "readRDS", name="mydata", roxygen=FALSE) writeLines(c("```r", pkg, "```"))
This code can be copied into a file within the package. If you set
roxygen=TRUE
you'll get roxygen help that devtools::document()
will convert into R help files and NAMESPACE
declarations.
The package can then be loaded and data accessed with the mydata
function.
To make the release:
Increase the version number in your DESCRIPTION
file
Your local repo is all committed (no unstaged files etc). This is important if you want to closely associate the release and your data and at the moment datastorr enforces it.
Push your changes to GitHub and install your package
Run yourpackage::yourdata_release("A description here")
Check that it all worked by running yourpackage::yourdata("new version")
(you can get your new version by read.dcf("DESCRIPTION")[,
"Version"]
).
Because GitHub offers private repositories, this gives some primitive, but potentailly useful, access control. Because datastorr uses GitHub's authentication, GitHub knows if the user has access to private repositories. Therefore for this to work you will need to authenticate datastorr to work with GitHub.
The simplest way to do this is to let datastorr prompt you when access is required. Or run:
datastorr::datastorr_auth()
to force the authentication process to run (no error and no output indicates success). To force using personal access tokens rather than OAuth, run:
setup_github_token()
which will walk you through the steps of setting a token up.
If you use a personal private repository, then you invite other users to "collaborate" on the repository. Note that this gives the users push access to the repository; the access control is very coarse.
If you have an organisation account you can create groups of users that have read only access to particular repositories, which will likely scale better.
Some will argue that it is not possible and they are probably
right. But you need to go with some versioning system. If the
idea of semantically versioning data bothers you, use incrementing
integers (v1
, v2
, v<n>
) and read no further!
The idea with semantic versioning is that it formalises what people do already with versioning. We feel this can be applied fairly successfully to data.
Update patch release; small changes, backward compatible.
Update minor version; medium changes, but generally backward compatible.
Update major version; large (API) changes, likely to be backward incompatible.
Forks make this a lot more complicated. If two people are working in parallel how do they decide what version number to use? However, with our solution, the datasets are still sensibly named; we have:
user1/dataset@v1.2.3
user2/dataset@v1.3.5
It's just not possible to know from the outside exactly what
differs between the datasets but they are at least distinctly named
(and you could download both of them). When the fork is resolved
and user2
merges back into user1
the two researchers can
discuss what version number they would want to use. Like resolving
merge conflicts, we see this as a social problem, not a
technological one and the soltuion will be social.
Apart from the ease of use, mindshare and the explicit association between data and code, there is no strong reason to use GitHub here. Certainly Bitbucket provides all the same functionality that is required to generalise our approach to work there. And self hosting would work too, with more effort. Over time we may develop support for alternative storage providers.
At the same time, the fast and generally reliable webserver, the access controls and the nice API make it a great first place to try this proof of concept.
GitHub has an API that lets you programmatically query the state of basically everything on GitHub, as well create things. So the interaction with the website is straightforward; getting lists of releases for a repository, filenames associated with releases, etc.
With this information, datastorr
uses a
storr_external
object and stores data with versions as keys. If a version is not
found it is downloaded (using the information from GitHub) and read
into R using the read
function. A copy of this R-readable
verison is saved to disk.
In order to save and load data repeatedly, especially across
different projects on the same computer, datastorr
uses the
rappdirs
package to find the "Right Place" to store "application
data". This varies by system and is documented in the
?rappdirs::user_data_dir
help page. Using this directory means
there is little chance of accidently commiting large data sets into
the repository (which might be a problem if storing the data in a
subdirectory of the project).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.