Introduction

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
Sys.unsetenv("CONTENTID_REGISTRIES")
Sys.setenv("CONTENTID_HOME"= tempdir())

library(contentid)

Reproducible Data Access

library(readr)
library(pins)
library(contentid)
library(stringr)

Barriers to data access

Traditional ways of working with data -- as files on a file system -- limit the reproducibility of code to local compute environments. A typical R analysis file will load one or many data files from the local disk with code like this:

delta_catch <- readr::read_csv('/Users/jkresearcher/Projects/2018/Delta_Analysis/delta_catch.csv')
delta_taxa <- readr::read_csv('../../Delta_2021/delta_taxa.csv')
delta_effort <- readr::read_csv('delta_effort.csv')
delta_sites <- readr::read_csv('data/delta_sites.csv')

Which of those file paths are the most portable? And which will run unmodified on both the original computer that they were written on, and on colleagues' computers? In reality, none of them, in that they require that a specific data file be present in a specific location for the code to work properly, and these assumptions are rarely met and hard to maintain. Hardcoded paths like these are often spread deeply through the scripts that researchers write, and can become a surprise when they are encountered during execution.

The Web partly solves this problem, because it allows code to access data that is located somewhere on the Internet with a web URI. For example, loading data from a web site can be much more portable than loading the equivalent data from a local computer.

delta_sites_edi <- 'https://portal.edirepository.org/nis/dataviewer?packageid=edi.233.2&entityid=6a82451e84be1fe82c9821f30ffc2d7d'
delta_sites <- readr::read_csv(delta_sites_edi, show_col_types = FALSE)
head(delta_sites)

In theory, that code will work from anyone's computer with an internet connection. But code that downloads data each and every time it is run is not particularly efficient, and will be prohibitive for all but the smallest datasets. A simple solution to this issue is to cache a local copy of the dataset, and only retrieve the original from the web when we don't have a local copy. In this way, people running code or a script will download the data the first time their code is run, but use a local copy from thence forward. While this can be accomplished with some simple conditional logic in R, the pattern has been simplified using the pins package:

delta_sites_edi <- pins::pin('https://portal.edirepository.org/nis/dataviewer?packageid=edi.233.2&entityid=6a82451e84be1fe82c9821f30ffc2d7d')
delta_sites <- readr::read_csv(delta_sites_edi, show_col_types = FALSE)
head(delta_sites)

You'll note that code takes longer the first time it is run, as the data file is downloaded only the first time. While this works well over the short term, abundant evidence shows that web URIs have short lifespan. Most URIs are defunct within a few years (e.g., see McCown et al. 2005). Only the most carefully curated web sites maintain the viability of their links for longer. And maintaining them for decade-long periods requires a focus on archival principles and dedicated staff to ensure that files and the URLs at which they are published remain accessible. This is precisely the role of archival data repositories like the Arctic Data Center, the KNB Data Repository, and the Environmental Data Initiative (EDI).

Finally, no discussion of data access and persistence would be complete without discussing the use of Digital Object Identifiers (DOIs). DOIs have become the dominant means to create persistent links to academic articles, publications, and datasets. As authority-based identifiers, they work when an authority assigns a DOI name to a published work, and then ensures that the DOI name always redirects to the current web location of the resource. This is a lot of work, and there is no guarantees that the authorities will keep the links up-to-date. Journals, societies, and data repositories actively maintain the redirection between a DOI such as doi:10.6073/pasta/b0b15aef7f3b52d2c5adc10004c05a6f and its current location on the EDI Repository. DOIs are commonly assigned to published datasets, and include the bibliographic metadata needed to properly cite and access the dataset.

The challenge with DOIs as they are typically implemented is that they are usually assigned to a Dataset, which is a collection of digital objects that are composed to form the whole Dataset and that can be accessed individually or through an API. Typically, the metadata attached to DOIs does not include an enumeration of those digital objects or a clear mechanism to get to the actual data -- rather, the DOI redirects to a dataset landing page that provides a human readable summary of the dataset, and often various types of links to find and eventually download the data. Despite advances in metadata interoperability from DCAT and schema.org/Dataset, there is currently no reliable way to universally go from a known DOI for a dataset to the list of current locations of all of the digital objects that compose that dataset. And yet, this is exactly what we need for portable and persistent data access. In addition, we frequently work with data that doesn't have a DOI yet as we are creating derived data products for analysis locally before they are published. Overall, DOIs are a great approach to uniquely citing a dataset, but they do not provide a way for code to download specific, versioned digital objects from a dataset in a portable way that is persistent over many years.

Thus, we want data access to be:

A powerful approach to solving these problems is by using content-based identifiers, rather than authority-based identifiers like DOIs. A content-based identifier, or contentid for short, can be calculated from the content in a data file itself, and is unique (within constraints) to that content. This is accomplished by using a "hash" function, which calculates a relatively short, fixed-length, and unique value for any given input. Hash functions form the basis of secure cryptography for secure messaging, and so there are many tools available for conveniently hashing data inputs. In our use case, we can use commonly available cryptographic hash functions (such as SHA-256 and SHA-1) to calculate a unique identifier for any given file. This gives us a unique identifier for the file which can be calculated by anyone with a copy of the file, and which can be registered as metadata in repositories that hold those files.

Once we have a content identifier for an object, we can cache the file locally (just like we did with pins), and we can query repositories to see if they contain a copy of that file. Unlike authority-based identifiers, anyone who possesses a copy of a specific version of a data file can calculate the content-identifier for it, enabling us to build systems to find and access those data files across the repository landscape, and really across any web-accessible location. This has all of the power of cacheing and pinning web resources that we demonstrated before, but has the advantage that all holders of the content will use an identical identifier, avoiding broken links. And because content-identifiers can be determined locally before files are published on the web, we can use them in our scripts for data files that have yet to be published and yet know that they will work for others once the files have been published in a repository.

Persistent and portable data access for improving reproducibility

We illustrate this with the following IEP dataset that is stored on EDI:

Interagency Ecological Program (IEP), B. Schreier, B. Davis, and N. Ikemiyagi. 2019. Interagency Ecological Program: Fish catch and water quality data from the Sacramento River floodplain and tidal slough, collected by the Yolo Bypass Fish Monitoring Program, 1998-2018. ver 2. Environmental Data Initiative. https://doi.org/10.6073/pasta/b0b15aef7f3b52d2c5adc10004c05a6f (Accessed 2021-10-30).

You can view this IEP dataset on DataONE:

It also is visible from the EDI dataset landing page:

It contains several data files, each of which is at a specific web URI, including:

delta_catch_url <- "https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F233%2F2%2F015e494911cf35c90089ced5a3127334"
delta_taxa_url <- "https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F233%2F2%2F0532048e856d4bd07deea11583b893dd"
delta_effort_url <- "https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F233%2F2%2Face1ef25f940866865d24109b7250955"
delta_sites_url <- "https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F233%2F2%2F6a82451e84be1fe82c9821f30ffc2d7d"
delta_catch_edi <- 'https://portal.edirepository.org/nis/dataviewer?packageid=edi.233.2&entityid=015e494911cf35c90089ced5a3127334'
delta_taxa_edi <- 'https://portal.edirepository.org/nis/dataviewer?packageid=edi.233.2&entityid=0532048e856d4bd07deea11583b893dd'
delta_effort_edi <- 'https://portal.edirepository.org/nis/dataviewer?packageid=edi.233.2&entityid=ace1ef25f940866865d24109b7250955'
delta_sites_edi <- 'https://portal.edirepository.org/nis/dataviewer?packageid=edi.233.2&entityid=6a82451e84be1fe82c9821f30ffc2d7d'

Storing a content identifier from a URI

Use the contentid package for portable access to data. First, using a web URI, store the content identifier in your local content registry to cache it on your machine. The contentid::store() function retrieves the data from the URL, calculates a hash value for the content, and stores both in a local registry on your machine. This is very similar to the pins::pin function, but it uses the content identifier to point to the data.

delta_catch_id <- store(delta_catch_url)
delta_taxa_id <- store(delta_taxa_url)
delta_effort_id <- store(delta_effort_url)
delta_sites_id <- store(delta_sites_url)

print(c(delta_catch_id=delta_catch_id, 
        delta_taxa_id=delta_taxa_id,
        delta_effort_id=delta_effort_id, 
        delta_sites_id=delta_sites_id))

Accessing a data file using a URL or file path creates a certain degree of ambiguity from a reproducibility and provenance perspective. Is the intent of the code to read in the "latest version" of whatever content is found at that address, or has the code been written with the assumption that it is recieving exactly the same input every time? Content-based identifiers provide a mechanism for code to be more explicit on this point. Instead of writing out an ambiguous URL or filepath into the script, we can explicitly embed the content hash, using variable assighment to give the opaque hash a more convenient alias:

delta_catch_id <- 'hash://sha1/317d7f840e598f5f3be732ab0e04f00a8051c6d0'
delta_catch_id <- 'hash://sha256/e0dc10d7f36cfc5ac147956abb91f24f2b2df9f914a004bbf1e85e7d9cf52f41'
delta_taxa_id <- 'hash://sha1/1bf0da8443e5bf8c9e7d16bf715a33129c9ff169'
delta_taxa_id <- 'hash://sha256/1473de800f3c5577da077507fb006be816a9194ddd417b1b98836be92eaea49d'
delta_effort_id <- 'hash://sha256/f2433efab802f55fa28c4aab628f3d529f4fdaf530bbc5c3a67ab92b5e8f71b2'
delta_sites_id <- 'hash://sha256/e25498ffc0208c3ae0e31a23204b856a9309f32ced2c87c8abcdd6f5cef55a9b'

Loading data from a content identifier

Once you have the content identifier for a data file of interest (e.g., delta_catch_id in this case), you can call contentid::resolve() to find the locations where that data is stored. Because you already have it stored locally, it returns the file path to the file on your local registry, which you can then use to load the data into a data frame or process the data as needed.

delta_catch_file <- contentid::resolve(delta_catch_id, store = TRUE)
delta_catch <- readr::read_csv(delta_catch_file, show_col_types=FALSE)
head(delta_catch)

# And two more examples
delta_taxa_file <- contentid::resolve(delta_taxa_id, store = TRUE)
delta_taxa <- readr::read_csv(delta_taxa_file, show_col_types=FALSE)

delta_sites_file <- contentid::resolve(delta_sites_id, store = TRUE)
delta_sites <- readr::read_csv(delta_sites_file, show_col_types = FALSE)

This approach is portable, as anyone can run it without having the data local beforehand. This is because resolve(id) will store the data locally if someone does not already have a copy of the data in their local cache. This works by consulting a number of well-know registries to discover the location of the files, including DataONE, Hash Archive, Zenodo, and Software Heritage.

This approach is persistent, because it pulls data from these persistent archives, and can take advantage of archive redundancy. For example, here is the list of locations that can be currently used to retrieve this data file:

contentid::query_sources(delta_catch_id, cols=c("identifier", "source", "date", "status", "sha1", "sha256"))

# [BUG FILED](https://github.com/cboettig/contentid/issues/81): `query_sources should not return an error on inaccessible repos -- it should skip them and produce a warning, so that the local repo will still work when disconnected from the internet

This approach is reproducible, as the exact version of the data will be used every time (even if someone changes the data at the original web URI, which would require a new content identifier).

This approach is traceable because there is a reference in the code to the specific data used based on in its content identifier, and the only way to change which data are used is to change the checksum that is being referenced to a new version.

Storing and using local data identifiers

Because not all data are already published, it is also helpful to being working with content identifiers before the data are made public on the web. This is easily accomplished by storing a file in the local regitry, and then using its content identifier during analysis.

# Store a local file
vostok_co2 <- system.file("extdata", "vostok.icecore.co2", package = "contentid")
id <- store(vostok_co2)
vostok <- retrieve(id)
co2 <- read.table(vostok, col.names = c("depth", "age_ice", "age_air", "co2"), skip = 21)
head(co2)

Later, when the data file is published to a DataONE repository, or registered in Hash Archive, the script will work for other people trying to access it via contentid::resolve().

contentid with metadata ecosystems

There are many tasks that contentid does not solve, but it is designed in such a way as integrate easily into existing solutions.

Here we outline an example that illustrates one such solution to these challenges using the schema.org/Dataset metadata model. While Schema.Org is a flexible and widely recognized metadata format, many other metadata formats are used for a wide variety of purposes. The same algorithmic steps we will accomplish with Schema.org markup could be done with many other standards, including the metadata of major data archives like the DataONE network, or just as easily with a customized internal format.

rfishbase is an R package which accesses regularly (semi-annually) released snapshots of tables from the FishBase.org database.
This R-based software needs a mechanism to ask for the "the latest version" of a given table, say, "the species table", as well as older versions. These tables are serialized as .parquet files (or previously, as compressed tab-separated-value files) and posted to a downloadable location. Previous versions of the package would download this data directly from a given URL, such as: https://github.com/ropensci/rfishbase/releases/download/fb-19.04/fb.2fspecies.tsv.bz2

The package relied on the URL path to identify both the data version (19.04, i.e. April 2019), and table name (species). To select the 'latest' version, the package had to rely on the GitHub API to report the most recent available "tag". This makes the approach vulnerable to all the challenges of URL-based access, such as the need to cache results to avoid re-downloading the same bytes, or the robustness provided by being able to resolve the content to multiple provider URLs. However, at first glance, this appears very incompatible with the approach of contentid, since the code must be able to access the latest version.

To address these challenges, we introduce a metadata layer, using the schema.org standard to describe a "DataDownload" object (see https://schema.org/DataDownload) for each table in the database. An excerpt of this markup looks like this:

{
  "@context": "http://schema.org/",
  "@graph": [
    {
      "id": "urn:uuid:6e6c12fb-3b2f-4a2b-a81d-cf662b5ae321",
      "type": "Dataset",
      "issued": "2021-06-01",
      "license": "https://creativecommons.org/licenses/by-nc/3.0/",
      "name": "Fishbase Database Snapshot: A Parquet serialization",
      "version": "21.06"
      "creator": {
        "type": "Organization",
        "name": "FishBase.org"
      },
      "version": "21.06",
      "description": "Database snapshot prepared by rOpenSci courtesy of Fishbase.org",
      "distribution": [
      {
        "id": "hash://sha256/11284f8036fdb3599ebeb503c6e32dab6642ffbc1f5be1083c1590eb962a188b",
        "type": "DataDownload",
        "contentSize": 4954978,
        "dateCreated": "2021-11-16",
        "description": "output data",
        "encodingFormat": "application/vnd.apache.parquet",
        "name": "species.parquet"
      },
  ...

From this JSON-LD file [@jsonld], we can learn the content-identifier of a table with the name "species.parquet" created on 2021-11-16.
Other versions of this table with the same name, but different creation date and (potentially) different content identifier, can be found in other blocks within the metadata record. This relatively minimal record provides only some core metadata, such as the encodingFormat and contentSize, though it would be straight forward to include additional fields reflecting other potentially relevant information, such as authorship, citation, and data provenance. Some of this is already visible in the parent Dataset object. For the described use case however, this record has everything the software needs to determine the content identifier from any requested "version" of the table. Once the content identifier is determined, contentid::resolve() can successfully resolve the identifier to the actual data, wherever it may be found: in this case, on the user's harddrive, if previously downloaded, or on GitHub URL, or in the Software Heritage snaphot.

The rfishbase package accesses this metadata record first by attempting to read the most recent version of the JSON-LD file from a URL, then falling back on a local copy of the metadata file if an internet connection is unavailable. Unlike the earlier version, of rfishbase this avoids dependence on the GitHub API to determine version, supports local caching automatically, and ensures the data file has not been corrupted.
The metadata record provides a generic, portable, machine-readable description of all tables that can be accessed by the package, along with a natural mechanism for including additional information about each table, if desired.

The use of such a metadata record in the package solves the three concerns intitially highlighted above: (1) The metadata provides a well-defined link between the specific content and the citation that can be read by both humans and machines. (2) Reference to a specific table is no longer opaque: in this case, a file name, species.parquet, is used as a human-friendly alias. Obviously the metadata record could provide richer information, such as the description, encodingFormat, or other recognized fields of schema.org/DataDownload to make an even more transparent reference. (3) We are not restricted to accessing a specific version, but can always query for the "latest" version of the content, as defined by this metadata record.

At the same time, we emphasize that this approach is merely facilitated by contentid.
None of the mechanics of contentid make any assumption that metadata will be represented in the Schema.org format, or any other format. contentid does not ignore metadata because it believes metadata is not essential. Rather, there are simply too many kinds of existing metadata formats, all serving many different use cases. Users are encouraged to select the metadata structure that works best for them in concert with contentid.

Registries

contentid is built around the concept of "registries".
Registries provide a mechanism to resolve a content-based identifier to a location (URL or local path) at which the desired content can be accessed. Some registries are also storage providers for the content in question, while others may simply be a directory lookup service. contentid distinguishes between "local" registries, implemented by the package itself on the local disk and not requiring any sort of internet connection, and "remote" registries, which access the a third-party service through internet requests.

At the time of writing, contentid supports three flavors of local registries: a content-addressed store, a tsv-backed registry, and a Lighting-backed Memory Database, LMDB registry. Additionally, contentid currently supports four types of remote registries: which include permanent data archives such as DataONE, Software Heritage, and Zenodo, as well as Hash-Archive.org, which registers but does not store content.
Most contentid functions, such as resolve, automatically select a collection of default registries, as returned by the default_registries() function:

default_registries()

The defaults include two local registries: a tsv-backed registry (indicated by a file path ending in registry.tsv) and a content-addressed registry (indicated by a filepath to a directory). Only the LMDB registry is not activated by default, as the R package providing the driver for the LMDB connection, thor [@thor], is only an optional dependency. The defaults can be configured using a comma-separated list in the environmental variable CONTENTID_REGISTRIES, or by passing a vector of desired registries to the registries argument of most functions. Note that it is possible to set multiple local registries of the same type (e.g. two tsv-backed registries) simultaneously, if desired. Not all registries support all the same features, in particular, not all support the same hash algorithms, as discussed below.

Local registries

The content-based registry is set by providing an absolute filepath to any writable directory on the local machine. The default is given by the value of tools::R_user_dir("contentid"), or may be set using the environmental variable CONTENTID_HOME. This registry is also a store, meaning that it holds the data itself and not just an address to the data. Files can be added to this registry using the store() function, or by setting store=TRUE on calls to resolve(). This is the simplest form of registry possible: files are copied to the selected directory and given names based on their SHA-256 hash. To improve filesystem performance, files are placed in subdirectories based on the first and second pairs of characters in the (base32) SHA-256 hash, i.e. a file with identifier: hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 will be stored in nested sub-directory 94/12/ with name 9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37. Knowing only the hash, it is trivial to construct the path (this task is performed by contentid::retrieve()). This registry is checked first of all possible registries when calling resolve(), allowing the function to return as quickly as possible when the data is already stored there.
(This is also the only registry that does not force a checksum calculation to resolve, making it very fast even on large objects).
Note that this registry is specific to sha256.

Managing content storage: Storing data by content-address ensures that our files are not accidentally over-written by other content, and provides the most minimal and fastest data registry and store. However, because hashes are opaque, without a metadata record it is easy to lose track of what objects are in the content store. Meanwhile, because every unique version of a file is preserved, it is also possible to fill up available storage very quickly. Therefore, users are encouraged to treat the local content store as temporary cache only that works in conjunction with data that can be accessed using another registry type, rather than as the sole long-term storage location for data. As such, to manage file size, it may be helpful to purge this storage location from time to time when storage grows to large. The helper utility, purge_cache(), does exactly that, removing the oldest files in the content store until the cache is reduced to within a specified threshold size, or any files over a maximum age.

TSV-backed registry. The tsv-backed registry uses a simple tab-separated-value list to match URLs or filepaths to a content identifier. By default, this registry computes and stores a sha256 checksum, which can be configured using the environmental variable "CONTENTID_ALGOS" (which also recognizes md5, sha1, sha384 and sha512 checksums). Entries can be added to the tsv registry using the register() function.
By default, register() will also attempt to register URLs with any hash-archive type registries, discussed below. Here we enforce a tsv-only registry, specifying the creation of a tsv file in the current working directory rather than the default location.

id <- register("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542",
              "registry.tsv")

Registering returns a content identifier. We can look up all registered sources for this identifier in the registry we just created:

sources(id, "registry.tsv")

Note that this registry is not also a store -- the tsv file preserves only the URL or path at which the source was found, not the data itself. If the file is moved to another local path or URL, resolve() will detect the failure and try the next available source, if any.
It is possible to register a local filepath instead of a URL, though these sources will not be portable. Extracting sources from a tsv-backed registry is quite fast owing to the performance of vroom.
However, users maintaining local registries with millions of entries will benefit from the much greater performance of LMDB, provided by the thor package.

LMDB-backed registry

The LMDB-backed registry is the only one not activated by default. It is a special-case registry that can be useful for users seeking to manage a huge number of individual objects by contentid, typically without calls to remote services. As such, it will typically be used alone, rather than simultaneously with (generally much slower) calls to other registries.

# LMDB is a database handle, not a path
db <- default_lmdb()
id <- register("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542",
               registries = db)
sources(id, db)

All local registries are locally-persistent, meaning that information stored there should persist to disk over multiple R sessions, but is accessible only on the local hard disk of the machine creating the registry. It is possible to share local registries with other users or machines by merely copying over the .tsv or contents of the content-store directory.

Remote registries

Software Heritage

DataONE

Zenodo

Hash-Archive

Content-based identifier types and formats

Hashes are computed using openssl [@openssl]. Contrary to conventional wisdom, less complex hashes such as md5 are not necessarily faster to compute than cryptographically sound hashes such as sha256, especially with larger objects, owing to hardware-based acceleration for the universally important SHA family of hashes, which has been built into most modern hardware chips and expressly supported by recent versions of the OpenSSL library.

Discuss choice of hash algorithm, what is supported where

Discuss alternative serializations (named info etc), and (partial) support for recognizing them.

Comparison to other approaches

Many other widely recognized software tools and techniques also rely on content-based identifiers, including, but limited to, IPFS, DAT, git, Torrents, and blockchains. ...


What is a content-based identifier?

Content-based vs location-based

We typically think of data files through a location-based paradigm. A file location may be specified a Uniform Resource Locator (URL), such as https://example.com/data.csv, or a local relative or absolute file path, such as path/to/data.csv or /home/user/path/to/data.csv. Note that the file name, data.csv, can be considered an example of 'relative' file path. DOIs and other persistent identifiers (e.g. EZIDs, UUIDs) are typically used in a location-based manner as well: a central service "resolves" a DOI to a specific URL of a specific data repository. The use of the redirect makes it possible for that URL to be updated later, ameliorating the issue of link-rot [@fenner], but the identify of the file remains specified by it's location in a particular data repository or archive. By contrast, content-based identifiers refer to files by the cryptographic checksums or hash. A data file has the same checksum regardless of it's location: one of the primary uses of checksums has been to ensure that a file has not been altered during transfer from one location to another. Content-based-address systems such as git and dropbox store and retrieve files based on such checksums. Because every committed change to a file has a unique hash, this approach is particularly compelling for version control (git). Because identical files have the same hash, this approach is also a natural choice when de-duplication is a priority (dropbox). In both systems, the user is presented with a location-based interface, allowing a user to rely on location-based intuition for accessing files, while simultaneously being able to work with the same content across multiple locations or devices. Content-based address systems are also a key component of distributed file servers such as torrents [@torrent]. The success of such platforms has also led to various initiatives to provide "git for data" [@IPFS; @dat]. This paper is not a proposal for building such a platform, but rather, seeks to examine design principles which allow any such approach to efficient, interoperable, and compatible with existing data archiving systems. We illustrate how a content-based identifier approach can be incorporated into daily scripts in place of local paths or URLs, providing robust and reliable access to data using an easy-to-generate persistent identifier that can follow a data product from the moment it exists in digital form through preliminary distribution to publication in a data archive.

From @hash-uri

The hash URI scheme follows RFC 3986.

 hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37?type=text/plain#top
 \__/   \____/ \______________________________________________________________/ \_____________/ \_/
  |       |               |                                                            |         |
scheme algorithm         hash                                                        query    fragment

Scheme: The fixed string "hash".

Algorithm: A hash algorithm name, in the form of a pseudo-DNS name. Acceptable characters are alpha-numerics, . (period) and - (hyphen). Dotted segments must have at least one character and cannot begin or end with hyphens.

Hash: The hash of the resource, using the given algorithm. Currently must be encoded in hexadecimal (base-64 support is planned). Can be truncated to any non-zero number of characters, although this may lead to ambiguity in resolvers. Hex encoding is case-insensitive but other encodings may be case-sensitive.

Query: Query parameters, which are interpreted by the resolver. Since a hash URI can be resolved by different systems, query parameters must be semi-standardized.

Fragment: Indicates a sub-resource, if any.

$$\underbrace{\texttt{hash://}}{\textrm{scheme}}\underbrace{\texttt{sha256/}}{\textrm{algorithm}}\underbrace{\texttt{9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37}}{\textrm{hash}}\underbrace{\texttt{?type=text/plain}}{\textrm{query (optional metadata)}}$$

Why use content-based identifiers

DOIs do not refer to specific content, which makes them difficult to use directly in scripts and software packages. In general, a DOI merely needs to redirect to the landing page of a persistent archive. Consider , the DOI for the NASA MODIS satellite data product for "Thermal Anomalies and Fire Daily (MOD14A1) Version 6", which includes hundreds of thousands of individual data files updated daily and distributed through a web interface or FTP server. The DOI is sufficient for a suitably knowledgeable human being to successfully locate and download the data, but not well suited for use in computer code which must know precisely which individual data files to download and process. Even most DOIs that resolve to permanent and immutable data objects cannot reliably resolved by computer code to find the download URLs for the actual content. DOIs are principly designed for humans, not computers. How then can we reliably reference and retrieve specific archival data in scientific code scripts and software packages?

A central premise of any digital data preservation strategy is the LOCKSS principle: "Lots of copies keeps stuff safe." For example, DOI-granting academic journals typically[^1] participate in partnerships such as CLOCKSS (Controlled LOCKSS, @CLOCKSS) in which members duplicate or mirror content from other participants. The DataONE repository network takes a similar approach in which member data repositories mirror some content from the other repositories. These approaches rely on a centralized service or coordinating node that can resolve a request for particular content to the appropriate location, which still creates a single point of failure. Content-based identifiers allow a similar approach to distributed storage through a more fully de-centralized approach. Research data files are already frequently found at multiple locations: a local hard-drive, a university server, a cloud storage provider, a GitHub repository, or a formal data archive. Any user can construct a look-up table of content-based identifiers and the sources (URLs or local file paths) at which the corresponding content has been found. Note that because we can always compare the identifier with the checksum signature of the currently found at a given location, we have cryptographic certainty that it is the desired content. We refer to such a look-up table as a content "registry," which may also list other relevant information about the source, such as the date at which the content was found. Any of these locations may be subject to link rot, in which content changes or moves. In such cases, those sources will no longer be able to produce content with a hash matching the requested identifier, signalling that we will have to try an alternative source from the registry. By itself, this approach does not necessarily guarantee long-term redundant storage: the registry only points to other storage systems. This contrasts with the standard approach of scientific repositories, which only issue permanent object identifiers for content in their own storage system. Decoupling the roles of "resolving" and identifier to content vs "storing" the content provides additional flexibility that can be very powerful. Because most data repositories already compute and store checksums for the data files they contain (a neccessary element to ensure archival data is not corrupted), data repositories are natural registries of their own content - able to map identifiers to download URLs. A decentralized registry approach immediately allows us to extend this strategy to accomodate data that is not (or not yet) in a permanent data archive, and also provides a mechanism to refer to data in multiple locations without a central coordinating node. Because the content-identifier ensures the content is perfectly identical, we are also free to choose the most convenient source, such as a locally-hosted file, rather than relying on a download from a trusted repository to ensure authenticity.

Using a content-based identifier ensures that data is not accidentally changed or altered. Local paths can easily be overwritten, and files at a given URL updated. Sometimes this is desirable, but when reproducibility is required, referencing data by content identifier avoids this risk. While chance collisions with MD5 are SHA-1 are extremely unlikely, it is possible at least in principle to generate collisions in which altered content has the same MD5 or SHA-1 sum as the desired content [@collisions]. By contrast, SHA-256 hashes are considered cytographically secure at this time, giving a robust assurance that the data has not been altered. Traditionally, users are encouraged to manually verify the checksum of any data downloaded from a DOI or URL to ensure that the data has not been altered (either maliciously or due to packet loss) during transmission. In practice this extra step may be uncommon, as it requires additional effort and many data repositories do not clearly display the checksum in the first place. Using a content-based identifier allows us to build in such verification to the download process by default. For example, the resolve() function automatically verifies that the downloaded object matches the checksum specified in the identifier.


Generating a DOI can only be done with the help of an internationally recognized DOI-granting repository. The user or the repository must typically pay a nominal fee associated in minting the DOI, and must register minimal metadata required to generate a citation to the deposited object(s) with the central authority (DataCite for data-DOIs) in exchange for the DOI. Users or software interacting with the repository thus need to authenticate a user. In contrast, open source content hash algorithms are free and widely available on almost all computer platforms. Algorithms such as MD5, SHA-1, and SHA-256 are some of the most widely used, studied and implemented on the planet, and thus least likely to be lost to future generations. As evidence of this centrality, the SHA-2 family of checksums, including SHA-256 are so central to modern computing that major chipmakers now build support for this method directly into the chip hardware, supported by assembly-based code written in major open source implementations such as openssl [@openssl]. Consequently, on recent processors the SHA-256 checksum is often significantly faster to compute then the less secure MD5 and SHA1 algorithms, especially on larger files. This should alleviate the primary objection most data archives have to adopting the more secure SHA-256 in place of MD5 or SHA1. Because minting a DOI for data requires communication with a registered repository, this typically requires online access. In contrast, a content-based identifier can be minted offline.

A major advantage of content-based identifiers in scripts is the ability to avoid re-downloading data that has already been downloaded locally. This efficiency is particularly important to debugging scripts or automating workflows, where the same code may be run repeatedly. By maintaining a local content-addressed storage cache, software can determine if the requested identifier already exists locally before attempting to resolve the identifier to a registered data repository or URL. This allows us to ensure that our code can run successfully whether or not the data has been downloaded.

co2_file <- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37", 
                    store = TRUE)

A Uniform Resource Identifier (URI) is the gold standard for metadata objects to refer to the data they document. The Resource Description Framework (RDF) requires URI-based identifiers. DOIs are an example of such a URI, but DOIs typically resolve to HTML landing pages and thus do not identify the various data objects precisely. A URL is also a URI, but may not always resolve to the same content, and the URL may not be known at the time the metadata is generated. Universal Unique Identifiers (UUIDs) are another common choice, with an efficient standard algorithm. Unfortunately, this approach can generate different identifiers for the same data object, which can be particularly confusing when the metadata and UUID are being generated directly by scripts. Metadata systems also often refer to objects using identifiers that are not globally unique, such as filenames or id numbers.

While scientific papers typically print their DOIs in the article PDF directly, it is usually impossible to know the identifier for a data file from inspecting the data itself. Instead, the identifier is typically recorded in a separate metadata file, making it possible for the identifier information to be become separated from the data file itself. Using content hashes, the content becomes it's own identifier. As long as we have the data file, we can easily determine the identifier by re-computing the appropriate checksum. This also means that there is never any need to 'pre-register' or 'reserve' the identifier, the moment the data exists in a digital serialization, it has a unique identifier we can use to refer to it.

Many scientific repositories already record

A better way to reference specific content is to use the content's cryptographic hash, such as the SHA-256 checksum. Checksums such as MD5, SHA-1, and SHA-256 are commonly used in data repositories to ensure data integrity -- that a file has not been corrupted due to slow degradation of the hardware such as the disks on which it is stored, or that some bits have not been lost or altered during a file download. Such checksums are some of the most widely used algorithms in computing: universally recognized, widely implemented, and efficient. Because it is also cryptographically secure,

Many scientific data repositories already store and list checksum information, and even support searches for objects by their checksums. This allows us to use checksums as natural identifiers for objects in these data repositories.

Checksums have many advantages compared to alternative identifiers for individual data objects: they are (1) secure, (2) sticky, (3) portable (4) rot resistant, (5) cheap, (6) facilitate caching downloads, (7) facilitate caching workflows,

R users frequently write scripts which must load data from an external file -- a step which increases friction in reuse and creates a common failure point in reproducibility of the analysis later on. Reading a file directly from a URL is often preferable, since we don't have to worry about distributing the data separately ourselves. For example, an analysis might read in the famous CO2 ice core data directly from ORNL repository:

co2 <- read.table("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542", 
                  col.names = c("depth", "age_ice", "age_air", "co2"), skip = 21)

However, we know that data hosted at a given URL could change or disappear, and not all data we want to work with is available at a URL to begin with. Digital Object Identifiers (DOIs) were created to deal with these problems of 'link rot'. Unfortunately, there is no straight forward and general way to read data directly from a DOI, (which almost always resolves to a human-readable webpage rather than the data itself), often apply to collections of files rather than individual source we want to read in our script, and we must frequently work with data that does not (yet) have a DOI. Registering a DOI for a dataset has gotten easier through repositories with simple APIs like Zenodo and figshare, but this is still an involved process and still leaves us without a mechanism to directly access the data.

contentid offers a complementary approach to addressing this challenge, which will work with data that has (or will later receive) a DOI, but also with arbitrary URLs or with local files. The basic idea is quite similar to referencing data by DOI: we first "register" an identifier, and then we use that identifier to retrieve the data in our scripts:

register("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542")

Registering the data returns an identifier that we can resolve in our scripts to later read in the file:

co2_file <- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")
co2_b <- read.table(co2_file, 
                    col.names = c("depth", "age_ice", "age_air", "co2"), skip = 21)

Note that we have manually embedded the identifier in our script, rather than automatically passing the identifier returned by register() directly to resolve. The command to register() needs to only be run once, and thus doesn't need to be embedded in our script (though it is harmless to include it, as it will always return the same identifier unless the data file itself changes).

We can confirm this is the same data:

identical(co2, co2_b)

How this works

As the identifier (hash://sha256/...) itself suggests, this is merely the SHA-256 hash of the requested file. This means that unless the data at that URL changes, we will always get that same identifier back when we register that file. If we have a copy of that data someplace else, we can verify it is indeed precisely the same data. For instance, contentid includes a copy of this file as well. Registering the local copy verifies that it indeed has the same hash:

co2_file_c <- system.file("extdata", "vostok.icecore.co2", package = "contentid")
register(co2_file_c)

We have now registered the same content at two locations: a URL and a local file path. resolve() will use this registry information to access the requested content. resolve() will choose a local path first, allowing us to avoid re-downloading any content we already have. resolve() will verify the content of any local file or file downloaded from a URL matches the requested content hash before returning the path. If the file has been altered in any way, the hash will no longer match and resolve() will try the next source.

We can get a better sense of this process by querying for all available sources for our requested content:

df <- query_sources("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")
df
kableExtra::kable(df, "latex")

Note that query_sources() has found more locations than we have registered above. This is because in addition to maintaining a local registry of sources, contentid registers online sources with the Hash Archive, https://hash-archive.org. (The Hash Archive doesn't store content, but only a list of public links at which content matching the hash has been seen.) query_sources() has also checked for this content on the Software Heritage Archive, which does periodic crawls of all public content on GitHub which have also picked up a copy of this exact file. With each URL is a date at which it was last seen - repeated calls to register() will update this date, or lead to a source being deprecated for this content if the content it serves no longer matches the requested hash. We can view the history of all registrations of a given source using query_history().

This approach can also be used with local or unpublished data. register()ing a local file only creates an entry in contentid's local registry, so this does not provide a backup copy of the data or a mechanism to distribute it to collaborators. But it does provide a check that the data has not accidentally changed on our disk. If we move the data or eventually publish the data, we have only to register these new locations and we never need to update a script that accesses the data using calls to resolve() like read.table(resolve("hash://sha256/xxx...")) rather than using local file names.

If we prefer to keep a local copy of a specific dataset around, (e.g. for data that is used frequently or used across multiple projects), we can instruct resolve() to store a persistent copy in contentid's local storage:

co2_file <- resolve("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37", 
                    store = TRUE)

Any future calls to resolve() with this hash on this machine will then always be able to load the content from the local store. This provides a convenient way to cache downloads for future use. Because the local store is based on the content identifier, repeatedly storing the same content will have no effect, and we cannot easily overwrite or accidentally delete this content.

register() and resolve() provide a low-friction mechanism to create a permanent identifier for external files and then resolve that identifier to an appropriate source. This can be useful in scripts that are frequently re-run as a way of caching the download step, and simultaneously helps ensure the script is more reproducible. While this approach is not fail-proof (since all registered locations could fail to produce the content), if all else fails our script itself still contains a cryptographic fingerprint of the data we could use to verify if a given file was really the one used.

Acknowledgements

contentid is influenced by design and implementation of https://hash-archive.org, and can interface with the https://hash-archive.org API and mimic that functionality locally. contentid also draws inspiration from Preston, a biodiversity dataset tracker, and Elton, a command-line tool to update/clone, review and index existing species interaction datasets.

Use cases:

Citations: Citations are typically aggregated to the level of 'data packages' which may contain many objects (the DOIs used for products such as MODIS or NEON are an extreme example). Later releases may contain files that are carried over from previous releases unchanged. Data may also be re-used and re-published unchanged in later data packages in the process of analyzing previously released data files.

Do I cite the most recent record (latest version), series identifier for the version (if it exists) the oldest record (original), most authoritative? Under current practices, no doubt researchers will already cite the "wrong" version, such as failing to notice the same data product appeared as part of an earlier record before being republished as part of a newer one. When we rely on citations alone to understand data provenance, such cases are difficult to diagnose. One of the powerful ideas of content identifiers is having a clear vocabulary to distinguish between content and concept. Given a list of data products containing the content, a citation to the content identifier, and an agreed-upon procedure for determining the authoritative citation, any software responsible for adding up citation metrics has all the information it needs to resolve a citation to the correct authority

Worflows: The use of content-based identifiers helps in large workflows in several ways. As discussed above, it (1) guarantee that the workflow only runs with identical data each time and (2) accelerates performance by avoiding re-downloading of content each time a workflow is run. This approach is also important to more complex workflows where we can easily avoid re-computing expensive operations when relevant parts of a data analysis are unchanged. One example is a forecasting workflow using NEON data. NEON filenames frequently change without change to the underlying data. A content-based system can avoid re-running parts of an analysis that would be re-executed under a location-based protocol [@neonstore]

Limitations: Dynamic data, databases. Data stored in dynamic structures and extracted on the fly can be difficult to operate on in a content-based structure. However, it is important to note that such formats are also not considered best practice for long term archiving, as changes in software can render such data inaccessible. Nor are these approaches required to achieve scale. GBIF archives one large CSV. NEON or MODIS examples instead rely on thousands of individual files. This simplifies data transfer.
Modern, high-performance systems like parquet are designed explicitly to take advantage of distributed file storage rather than a single continguous database file. Such modular approaches facilitates provenance tracing based on content hash.

Schemes:

Downsides to hash URIs:

Downsides of alternatives:

contentid understands other schemes and can translate between them in most cases.

contentid:::as_hashuri("ni:///sha256;lBIyWDHasiruvdZ0tutTumt73QS7maTbsh3f9kYofjc")
x <- resolve("ni:///sha256;lBIyWDHasiruvdZ0tutTumt73QS7maTbsh3f9kYofjc", registries = content_dir())
content_id(x)

Registries in contentid

The contentid package can generate and maintain a simple registry of sources for content using a local plain text file in tab-separated-values (tsv) format, or local Lightning Memory-Mapped Database (LMDB). The latter will offer the best performance for large registries (containing millions or more identifiers). These local registeries can themselves be archived and distributed. contentid uses an extensible model which allows it to access an arbitrary number of separate tsv files and/or LMDB databases.
A more efficient mechanism of making such registeries available to others is provided by https://hash-archive.org, a project of the Internet Archive. Hash Archive is a small server platform which can take either a URL or a content-based identifier as input.
Given a URL, Hash Archive streams the data, computing the MD5, SHA1, SHA-256, SHA-384, and SHA-512 checksums and storing this information in its local registry, along with the corresponding URL, timestamp, and file size information.
Given a content identifier as input, Hash Archive returns a list of a URLs which have been previously registered matching that hash. contentid also treats several major data repositories as implicit registeries of their own content, including Zenodo, the DataONE Repository Network (with over forty member repositories, including Dryad, CDIAC, and EFI), and the Software Heritage Project.
Unfortunately, data repositories differ in their choice(s) of checksum. For rexample, at this time, SoftwareHeritage uses SHA-256, Zenodo uses MD5, and DataONE member repositories either choose or allow individual researchers depositing data to select their checksum algorithm.

df <- query_sources("hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37", cols = c("source", "date"))
df
kableExtra::kable(df, "latex")
Sys.unsetenv("CONTENTID_HOME")


cboettig/contenturi documentation built on Oct. 25, 2023, 10:37 a.m.