## ---
## title: "datastorr"
## author: "Rich FitzJohn"
## date: "`r Sys.Date()`"
## output: rmarkdown::html_vignette
## vignette: >
## %\VignetteIndexEntry{datastorr}
## %\VignetteEngine{knitr::rmarkdown}
## %\VignetteEncoding{UTF-8}
## ---
## ## Scope
## This package attempts to simultaneously solve a number of problems
## around small-scale data versioning and distribution:
##
## * Giving users access to your data in an easily machine-digestable
## format.
##
## * Hosting and distributing the data somewhere fast and reliable
## without having to deal with creating websites.
##
## * Protecting access to the data to collaborators, especially with
## the idea of later public release.
##
## * Allowing a dataset to be downloaded once and reused for multiple
## projects on a single computer, without having to deal with
## pathnames within or between systems.
##
## * Versioning the data so that:
## * fetching the current version is easy,
## * fetching a previous version is easy,
## * simultaneously looking at two versions is easy,
## * data versions are strongly associated with the code that created them,
## * end users do not have to use git,
## * large files do not end up clogging up your git repository.
##
## * Allows publication of data packages on CRAN without causing
## problems of large package file downloads.
##
## * Provides a common interface for storing and retrieving data that
## works across diverse underlying data formats (one or many csv
## files, binary data, phylogenic trees, or a collection of all of
## these), so long as you have a way of reading the data into R.
##
## The package is designed so be simple to use so that all that can be
## done in a couple of lines of code, or (for more involved cases)
## with a package that can be generated automatically.
## ## Background
## Data comes in all shapes and sizes, and a one-size fits all
## solution will not fit everything.
## * Too small: One-off data sets (e.g. a field experient that will
## not be updated). Put the data on data dryad, figshare, or
## wherever you fancy. Stick a fork in it, it's done (though you
## can use this package you'll likely find it easier not to).
##
## * Too big: Massively collaborative datasets with large end users
## communities, data sets that are so large they require access via
## APIs, data with access control requiring complex authentication
## layers, data with complex metadata where access is related to
## metadata. There are more comprehensive solutions for your data
## but identifying the correct solution may depend on the data.
##
## * Just right: A data set of medium size (say, under 100 MB), that
## is under moderate levels of change (either stabilising or a
## "living database" that is continually being updated).
## Getting data into R is typically done with a [data
## package](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Data-in-packages).
## This works well for small data, but CRAN will [not generally
## allow](https://cran.r-project.org/web/packages/policies.html)
## distribution of "large" data sets. The `data()` loading mechanism
## of R always seemed a bit of a weird historical quirk in any case;
## it operates in some additional namespace (`package:datasets`),
## works by modifying an environment as a side-effect. Plus if you
## need to compare two versions of the data you have to do some
## gynmastics to install two different versions of a package (or
## create a package with all the different versions of the data in
## it).
## ## How `datastorr` works
## GitHub has a "releases" feature for allowing (potentially large)
## file uploads. These files can be any format. GitHub releases are
## build off of git "tags"; they are *associated with a specific
## version*. So if you have code that creates or processes a dataset,
## the dataset will be stored against the code used to create it,
## which is nice. GitHub releases *do not store the file in the
## repository*. This avoids issues with git slowing down on large
## files, on lengthy clone times, and on distributing and installing
## your package. This could be an issue if you had 100 versions of a
## 10 MB dataset; that could be 1GB of data to clone or install. But
## storing your data against GitHub releases will leave the data in
## the cloud until it is needed. And the files can be quite large;
## [up to
## 2GB](https://help.github.com/articles/distributing-large-binaries).
## The releases will be numbered. We recommend [semantic
## versioning](http://semver.org) mostly because it signals some
## intent about changes to the data (see below). If the data is not
## going to change, that's not a problem - the version can just be
## `v1.0.0` forever (chances are it will change though!).
## We will make the simplifying assumption that your data set will be
## stored in a single file. In practice this is not a large
## limitation because that file could be a zip archive. The file can
## be in any format; csv, rds (R's internal format), a SQLite
## database. You, however, need to specify or provide a function that
## will read the data and convert it into an R object. This is most
## easily done with `rds` files (R's serialisation format -- though
## note they say that it is not a great long-term archival format [see
## `?serialize`]).
## To orchestrate getting the data from github to R we need to add a
## little metadata about what the file will be called and how it
## should be loaded into R. This can be done most simply with a small
## [json](https://en.wikipedia.org/wiki/JSON) file at the root of the
## repository containing information like:
##
## ```json
## {
## "filename": "myfile.rds",
## "read": "base::readRDS"
## }
## ```
##
## Note that the function used here must take a filename as an
## argument and return an R object. So functions like `read.csv`,
## `read.table` and functions from the
## [`rio`](https://github.com/leeper/rio) package may be good here.
## Once your git repository is set up, the metadata file added to it,
## and a release with data has been created, it can be downloaded
## like:
d <- datastorr::datastorr("richfitz/data")
## though, with your username/repo pair instead of `richfitz/data`.
## This function is designed to be *fast* for users, and so suitable
## for using in scripts. It uses
## [`storr`](https://github.com/richfitz/storr) behind the scenes and
## looks in various places for the data:
##
## 1. In memory; if it has been loaded within this session it is
## already in memory. Takes on the order of microseconds.
##
## 2. From disk; if the data has _ever_ been loaded datastorr will
## cache a copy on disk. Takes on the order of milliseconds up to a
## second, depending on the size of the data.
##
## 3. From GitHub; if the data has never been loaded, it will be
## downloaded from GitHub, saved to disk, and loaded to memory. This
## will take several seconds or longer depending on the size of the
## dataset.
## In addition, users can download specific versions of a dataset.
## This might be to synchronise data versions across different people
## in a project, to lock a project onto a specific version, etc:
d_old <- datastorr::datastorr("richfitz/data", version="1.0.0")
## (The same cascading lookup as above is used.)
## Versions can be listed; those stored locally:
datastorr::datastorr_versions("richfitz/data")
## or available remotely:
datastorr::datastorr_versions("richfitz/data", local=FALSE)
## The versions that have been downloaded (here `d` and `d_old`) are
## just normal R objects Unlike use with `data()` there's no ambiguity
## about where they are stored, and modifying one acts like any other
## object.
## Similarly, because these are ordinary R objects you can do things
## like use [`daff`](https://github.com/edwindj/daff) to compare them
##
## ```r
## p <- daff::diff_data(d_old, d)
## daff::render_diff(p)
## ```
## ## The package interface
## Alternatively we can create a very small R package that exists at
## the repo that we store releases against. This package can be
## autogenerated, and is a useful approach when there is a significant
## amount of work needed in processing the data, to simplify
## installation of dependencies used in reading or displaying the
## data, or to work with the data once it has been downloaded.In our
## own use, the repository (but not the package) contains code for
## _building_ the data set (see
## [taxonlookup](https://github.com/traitecoevo/taxonlookup)). The
## package approach will be described more fully later in the
## document.
## Once your git repsitory is published and your data have been
## released, downloading it becomes a function within your package. A
## user would run something like:
##
## ```{r,eval=FALSE}
## d <- mypackage::mydata()
## ```
##
## to fetch or load the data.
##
## ```{r,eval=FALSE}
## d <- mypackage::mydata("v1.0.0")
## ```
##
## This approach extends to holding multiple versions of the data on a
## single computer (or in a single R session). This might be useful
## when the dataset has changed and you want to see what has changed.
##
## ```{r,eval=FALSE}
## d1 <- mypackage::mydata("v1.0.0")
## d2 <- mypackage::mydata("v1.1.0")
## ## ...compare d1 and d2 here...
## ```
## ## Worked example
### So for a csv file you'd provide a function that would be
### `read.csv` perhaps with a few arguments set (e.g.,
### `stringsAsFactors=FALSE`). For an rds file you could just use
### `readRDS`. And for a zip file you would need a more complicated
### function (see the worked example).
### TODO: I should get a link to a cooking show and do a "Here's one I
### prepared earlier" thing.
## First, you will need a package. Creating packages is not that
## hard, especialy with tools like
## [devtools](https://github.com/hadley/devtools) and
## [mason](https://github.com/gaborcsardi/mason). Packages make
## running R code on other machines much simpler than sourcing in
## files or copy and paste. Packages are also nice because if your
## data require specific package to work with (e.g., `ape` for
## phylogenetic trees) you can declare them in your `DESCRIPTION` file
## and R will ensure that they are installed when your package is
## installed and loaded when your package is used.
##
## However, you will need to come up with a few details:
##
## * a package name
## * a name for the dataset (if different to the package name)
## * a _licence_ for your package (code) and data (not code)
## * ideally, documentation for your end users
## * the name of the file that you will store with each release
## In addition you need to set up a GitHub token so that you can
## upload files to GitHub from R, or to access your private
## repositories; see the section on authentication below, or just do
## nothing as datastorr will prompt you at the appropriate time.
## The core code can be autogenerated. For example the package
## [datastorr.example](https://github.com/richfitz/datastorr.example)
## was generated using
##+ eval=FALSE
datastorr::autogenerate("richfitz/datastorr.example", "readRDS",
name="mydata", roxygen=FALSE)
##+ echo=FALSE, results="asis"
pkg <- datastorr::autogenerate("richfitz/datastorr.example", "readRDS",
name="mydata", roxygen=FALSE)
writeLines(c("```r", pkg, "```"))
## This code can be copied into a file within the package. If you set
## `roxygen=TRUE` you'll get roxygen help that `devtools::document()`
## will convert into R help files and `NAMESPACE` declarations.
## The package can then be loaded and data accessed with the `mydata`
## function.
## To make the release:
## 1. Increase the version number in your `DESCRIPTION` file
##
## 2. Your local repo is all committed (no unstaged files etc). This
## is important if you want to closely associate the release and your
## data and at the moment datastorr enforces it.
##
## 3. Push your changes to GitHub and install your package
##
## 4. Run `yourpackage::yourdata_release("A description here")`
##
## 5. Check that it all worked by running `yourpackage::yourdata("new version")`
##
## (you can get your new version by `read.dcf("DESCRIPTION")[,
## "Version"]`).
## ## Access control
## Because GitHub offers private repositories, this gives some
## primitive, but potentailly useful, access control. Because
## datastorr uses GitHub's authentication, GitHub knows if the user
## has access to private repositories. Therefore for this to work you
## will need to authenticate datastorr to work with GitHub.
## The simplest way to do this is to let datastorr prompt you when
## access is required. Or run:
##
## ```r
## datastorr::datastorr_auth()
## ```
##
## to force the authentication process to run (no error and no output
## indicates success). To force using personal access tokens rather
## than OAuth, run:
##
## ```r
## setup_github_token()
## ```
##
## which will walk you through the steps of setting a token up.
## If you use a personal private repository, then you invite other
## users to "collaborate" on the repository. Note that this gives the
## users push access to the repository; the access control is very
## coarse.
##
## If you have an organisation account you can create groups of users
## that have read only access to particular repositories, which will
## likely scale better.
## ## Semantic versioning of data
## Some will argue that it is not possible and they are probably
## right. But you need to go with some versioning system. If the
## idea of semantically versioning data bothers you, use incrementing
## integers (`v1`, `v2`, `v<n>`) and read no further!
## The idea with semantic versioning is that it formalises what people
## do already with versioning. We feel this can be applied fairly
## successfully to data.
## * **Update patch release**; small changes, backward compatible.
## * adding new rows to the data set (more data)
## * error correcting existing data
## * **Update minor version**; medium changes, but generally backward
## compatible.
## * new columns
## * substantial new data
## * new tables
##
## * **Update major version**; large (API) changes, likely to be backward
## incompatible.
## * renaming or deleting columns
## * changing variable coding
## * deleting large amounts of data
## Forks make this a lot more complicated. If two people are working
## in parallel how do they decide what version number to use?
## However, with our solution, the datasets are still sensibly named;
## we have:
##
## * `user1/dataset@v1.2.3`
## * `user2/dataset@v1.3.5`
##
## It's just not possible to know from the outside exactly what
## differs between the datasets but they are at least distinctly named
## (and you could download both of them). When the fork is resolved
## and `user2` merges back into `user1` the two researchers can
## discuss what version number they would want to use. Like resolving
## merge conflicts, we see this as a _social_ problem, not a
## _technological_ one and the soltuion will be social.
## ## Beyond GitHub
## Apart from the ease of use, mindshare and the explicit association
## between data and code, there is no strong reason to use GitHub
## here. Certainly Bitbucket provides all the same functionality that
## is required to generalise our approach to work there. And self
## hosting would work too, with more effort. Over time we may develop
## support for alternative storage providers.
## At the same time, the fast and generally reliable webserver, the
## access controls and the nice API make it a great first place to try
## this proof of concept.
## ## How it _actually_ works
## GitHub has an API that lets you programmatically query the state of
## basically everything on GitHub, as well *create* things. So the
## interaction with the website is straightforward; getting lists of
## releases for a repository, filenames associated with releases, etc.
## With this information, `datastorr` uses a
## [`storr_external`](https://richfitz.github.io/storr/vignettes/external.html)
## object and stores data with versions as keys. If a version is not
## found it is downloaded (using the information from GitHub) and read
## into R using the `read` function. A copy of this R-readable
## verison is saved to disk.
## In order to save and load data repeatedly, especially across
## different projects on the same computer, `datastorr` uses the
## `rappdirs` package to find the "Right Place" to store "application
## data". This varies by system and is documented in the
## `?rappdirs::user_data_dir` help page. Using this directory means
## there is little chance of accidently commiting large data sets into
## the repository (which might be a problem if storing the data in a
## subdirectory of the project).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.