The rationale for lightweight versioned data is laid out in this paper. But the simple goal is to provide versioned data to the world in a way that allows coded access to all versions, new and old. Moreover, the idea of this project is to provide the tools to do this for free, and without too onerous on-going financial or time commitments, while maintaining curation of the underlying data.
This set of instructions relies on a basic knowledge of git and github. If you're a bit rusty on this see here for a general introduction. This tutorial frequently uses tools for setting up R packages. For a excellent and general introduction to the topic see Hadley Wickham's website/book.
versioned_data_template
repository.csv
file. (If you're data is too complex for a csv file, this is still possible, see below. The data may be--but does not have to be--pushed to the cloud repository.)devtools
if you don't have it alreadydatastorr
which manages the interface between your computer and github behind the scenes. For more on datastorr functionality see this repo. To install, in R run: devtools::install_github("ropenscilabs/datastorr")
dataset_access.R
file, rename the main function called dataset_access_function
to something specific to your dataset dataset_access.R
file, find the dataset_info
function and change 1) the name of the repository to your repository name 2) the name of the file to reflect the name of the file that contains your data. csv
file, you will have to write an input function that loads your data into R. Write this input function, include it in the dataset_access.R
and replace read_csv
with a call to your input function so that your dataset reads nicely into R in a way that's convenient for your users. dataset_access_function
which you renamed above. This will show up as the R help file for users once they download and install your package.devtools::document()
devtools::load_all()
<your_package_name>:::dataset_release("<description>")
Where <description>
is a description of what changed in the package. This should push version 0.0.1 to a github release. dataset_access_function
. The data should download from github and load nicely into R. The specifics of this depend on which DOI minter you use. We have used both zenodo and figshare. Each source has their own short tutorials for setting this up. The Zenodo/Github tutorial is here. All of the points made in the tutorial apply equally to code and to data.
That's it. You now have a package that is set up for distributing stable versioned data to the world.
We recommend suggesting that users flag issues using the "issue tracker" functionality of Github. This will allow specific questions to be asked, discussed, and resolved. Note: if you find an issue with this tutorial, please raise an issue on this repository! In some cases these queries may lead to improvements of the underlying dataset, in that case, it makes sense to release a new version of the database.
When your dataset improves via error fixes or data addition, and you're happy with the changes, there are a few simple steps to bump the dataset into the future.
DESCRIPTION
file to increase the version number. Semantic versioning is one way to manage these changes.DESCRIPTION
and push to GitHub<your_package_name>:::dataset_release("<description>")
where "<description>"
is a brief description of the improvements to the dataset.
Happy data versioning!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.