In cboettig/contentid: An Interface for Content-Based Identifiers

knitr::opts_chunk$set(message=FALSE, warning=FALSE)

Forecast Identifiers

This document proposes the use of content-based identifiers for publishing products associated with automated, iterative forecasts. Iterative forecasting will frequently involve automatically running code which ingests public data products and generates output forecast products, along with associated metadata. Consequently, the forecasts produced may depend on the code and software which defines the forecast algorithm, as well as the input data used. Researchers must be able to uniquely identify and access each forecast generated by running the algorithm, as well as the associated input data files and code.

Proposed approach

We propose that forecast products be identified by their SHA-256 checksum in the Hash-URI format:

hash://sha256/<HASH>

Note that this is an un-salted hash, containing no additional metadata beyond the pure file hash (in contrast to other content-based storage systems such as dat or IPFS). Consequently, the URI tells us everything we need to know to generate the hash (i.e. the algorithm used is a sha256 hash). For example, we can create an identifier for the csv serialization of popular example dataset, mtcars, [@mtcars] in R [@R] as follows:

readr::write_csv(mtcars, "mtcars.csv")
hash <- openssl::sha256(file("mtcars.csv"))
paste0("hash://sha256/", hash)

Here we have used the openssl package's implementation of the sha256 algorithm, which binds a fast and widely used C library [@openssl]. Many other implementations are readily available (e.g. the digest package in R, [@digest], sha256sum [@gnucoreutils] ), and will produce the identical hash.

Rationale

Any copy of the same file, the same bits and bytes, will result in the same identifier. Using a strong cryptographic hash such as sha256 ensures that even a malicious actor cannot change the file in question without altering the identifier as well.
This means that content-based identifiers are location agnostic. The content has the same identifier, regardless of whether it exists in a permanent archival repository or only on a single user's laptop. This contrasts from location-based identifiers, such as a URLs. Digital Object Identifiers (DOIs) are also location-based identifiers, because the DOI can resolve only to the specific repository hosting the content.
Content-based identifiers are ideal for distributed storage because they can be resolved to multiple locations. A DOI or other location-based identifier can be resolved to only a single location (URL) at a time (e.g. in the case of DOIs, by the https://doi.org resolution service), even though most robust archival storage requires backup copies of content be stored in other archives (e.g. DataONE network for data, or the LOCKSS or CLOCKSS networks used by scientific publishers). The Hash Archive, https://hash-archive.org, provides a service similar to the https://doi.org resolver for content-based identifiers, but instead return all registered locations. This same property of content-based identifiers underlies other distributed storage algorithms such as "torrents." Note that this is not a replacement for archival storage repositories: ideally at least one registered location corresponds to an archival repository for any data that needs to be permanently archived.
Content-based identifiers work well with data that is stored locally, stored on a non-archival (public or private) access point such as a GitHub repository or S3 bucket, or stored in any permanent scientific data archive (or all of the above simultaneously).
These identifiers are easy to generate in scripted workflows. Identifiers issued by a repository such as a DOI or other unique identifier (e.g. from Open Science Framework, OSF) which requires an available network connection for authenticating with a specific remote provider (e.g. using authentication tokens in the script which must be kept secure), and the latency involved in such communication.[^1]
These identifiers do not change if a script is re-run and produces the identical results and outputs. This is not true of scripts which automatically register identifiers with remote services or with other unique identifier algorithms that can be run locally, such as UUIDs [@uuid]. UUIDs include information such as a timestamp which ensure that the identifier is different every time it is generated. Iterative forecasts, by contrast, may be run and re-run many times to test code, verify reproducibility, or as part of continuous integration (CI) framework.
Content-based identifiers unambiguously identify specific content. Other identifiers such as DOIs may refer to a specific file, a collection of files, or even (as in the case of major remote sensing products) a general notion of a 'product' which contains thousands of component files which are continuously added and updated. Conversely, a single file can be identified by multiple different identifiers, even multiple DOIs. However,this is also a limitation of content-based identifiers: researchers often need identifiers to represent abstract concepts such as a "series identifier" which corresponds to the whole series of iterative forecasts, regardless of how many versions it contains. For abstract concepts, other identifiers are necessary.
Content-based identifiers are also easy to resolve in scripted workflows, because they can only resolve directly to their content. A DOI typically resolves to a HTML landing page, which provides a human-readable description of where to download the data, but lacks a consistent machine-readable mechanism. Programmatic access typically relies on an API that is specific to the repository.
Content-based identifiers facilitate local caching of data files, which avoids repeated downloads in an automated workflow. A script can easily confirm (with cryptographic certainty) that the desired content has already been downloaded locally, and then read the local copy instead of re-downloading from an authoritative location.
Content-based identifiers cannot become 'unstuck' from the content they identify. Typically, identifiers are stored separately in metadata records, which map the identifier to a particular location (e.g. a relative path in a directory, or a location in a permanent archive.) Consequently, it can be difficult to confirm that a given file corresponds to the desired identifier. In contrast, as long as you have the data file, you can always calculate the sha256 identifier for it.
Other identifiers frequently face a "chicken-and-egg' problem in automated workflows. This problem usually arises in attempt to address the previous problem of 'unstuck' identifiers, it is common practice to embed the identifier into the product itself (for example, most journal articles display their DOI on the first page, and many data packages include metadata files which state the identifier.) This requires a two-stage workflow in which the script must first 'pre-register' or 'reserve' and identifier, and then embed that identifier in the data file prior to uploading the data to a repository. Additional logic is required to either reserve a new identifier if the output product has changed, or avoid doing so if it has not.
Content-based identifiers permit a phased approach that is ideal for developing and testing a workflow before it is ready to be put into production. Few data repositories offer "testing" servers (the DataONE API is a notable exception) where a workflow that registers and uploads data can be run many times in testing without creating a permanent archive a lot of junk. A script using content-based identifiers can generate, register, and resolve such identifiers locally without ever making them public. When the researchers are satisfied with the script running locally, they can place a copy of the same data at any public location (university server, AWS S3 bucket, GitHub) and register that location, enabling collaborators to also resolve the files. When a workflow is finally deemed ready to begin generating permanent archives, the script only need be extended to upload and register the location of the permanent archive.

[^1]: Only for very large files do cryptographically strong algorithms such as sha256 require non-negligible computational effort (e.g. the hash of a 10 GB file takes less than a minute on a laptop machine), and will in any event represent a small fraction of computational effort required for actual analysis of the fie.

Example workflows

To facilitate the use of content-based identifiers, we provide a simple R package implementation, contentid. To illustrate a trivial forecasting workflow, we will begin with table of Carabid beetle species richness derived at biweekly sampling intervals for each site in the National Ecological Observatory Network, NEON [@carabid]. We resolve the species richness data using its content id:

Sys.setenv("CONTENTID_REGISTRIES"="https://hash-archive.carlboettiger.info")
library(contentid)
richness <- readr::read_csv(resolve("hash://sha256/280700dbc825b9e87fe9e079172d70342e142913d8fb38bbe520e4b94bf11548"))

For illustrative purposes, let us make a baseline probabilistic forecast using the historical mean and standard deviation as our prediction for the monthly species richness that will be observed at each site in 2021:

library(dplyr)
library(tidyr)
richness_forecast <- richness %>% 
  group_by(month, siteID) %>%
  summarize(mean = mean(n, na.rm = TRUE),
            sd = sd(n, na.rm = TRUE)) %>% 
  mutate(sd = replace_na(sd, mean(sd, na.rm=TRUE))) %>% 
  mutate(year = 2021)


readr::write_csv(richness_forecast, "richness_forecast.csv")

We can then compute the content identifier for our forecast using the function content_id():

content_id("richness_forecast.csv")

In order to resolve this identifier, we must first register it: Note that our call to register() also returns the file's content identifier, so we don't need to call content_id():

id <- register(fs::path_abs("richness_forecast.csv"), registries = "local.tsv")
id

We can now resolve this id:

resolve(id, registries = "local.tsv")

Because we registered only a local path to the file, this simply returns the relative, local path. This is still sufficient to use within a script:

forecast <- readr::read_csv(resolve(id, registries = "local.tsv"))

Eventually we may make this data file available at some public URL to share with colleagues or other computational resources before we are ready to publish it, such as GitHub or an S3 bucket. To illustrate this, I've placed a copy on an S3 bucket on my MINIO server. We can go ahead and register this new public URL:

register("https://minio.carlboettiger.info/shared-data/richness_forecast.csv")

Note this again returns the same identifier, which has been freshly calculated from the file. resolve() will work as before, and will still return our local path as long as that file exists and matches the identifier. But if we delete the file, or worse, accidentally overwrite it with some other data, resolve() will detect the identifier does not match, and resort to our local URL:

readr::write_csv(iris, "richness_forecast.csv") ## whoopsies!
resolve(id)

Note that this time, resolve() has not returned the local file richness_forecst.csv this time, but instead the path to a temporary file. Internally, resolve() has first confirmed that while the local path richness_forecast.csv still exists, the hash doesn't match the requested id. Fortunately, because we also registered a URL for this identifier, resolve() has fallen back on that alternative source, downloaded the file at that URL to the temporary directory, and then computed the content id of the downloaded file to confirm it still matched the requested identifier. This all happens behind the scenes, such that our workflow,

forecast <- readr::read_csv(resolve(id))

continues to work unchanged, despite the local copy being corrupted and the data now coming from the remote URL. In similar fashion, once our data is finally uploaded to a permanent data archive, we can add this most permanent location to the registry, much as we added the less-persistent URL of the local server.

Key advantages

As we have just seen, using this pattern of read_csv(resolve(id)) instead of the more common pattern, read_csv("mtcars.csv") has numerous advantages:

resolve() will automatically verify that the file read in matches the cryptographic hash, ensuring integrity and reproducibility.
resolve() will prefer local files when available, avoiding repeated downloads when a script is frequently re-run.
Once a public URL has been registered, resolve() will be more portable than scripts which assume a local file is available at a specific path.
By registering multiple URLs, resolve() can become more robust to link-rot [@Elliott2020].
Because this approach embeds the cryptographic signature of the data into our code, the approach degrades gracefully.

It is also worth noting that the strategy outlined here can easily be applied independent of the contentid R package in different computer languages and scripts. These benefits follow immediately from using content-based hashes as object identifiers. The approach taken in contentid is based on previous implementations, including Hash Archive (written in C) and preston (java) [@preston].

Technical notes: - the Hash URI format uses hexadecimal encoding of the hash, a 64 character lower-case alpha-numeric string. Alternative content-based identifiers recognized by Hash Archive, including named information (ni) and subresource integrity formats use base-64 encoding. While these are shorter, (43 characters), they are case-sensitive and include additional characters such as / which can lead to confusion or errors.
- While the hash URI format is not a W3C recognized format or namespace, we have found this format to be more intuitive and practical than alternatives. - Because hashes encode the most significant characters first, it is often possible to omit many of the trailing characters and still successfully resolve the identifier uniquely. Of course using fewer characters increases the chance of a collision. For example:

content_id( resolve("hash://sha256/280700dbc825b9") )

Publishing Identifiers

Sharing registries

Allowing any user to resolve identifiers to URLs requires a shared public registry, analogous to the DOI redirect service, https://doi.org. By default, contentid registers URLs with https://hash-archive.org, as well as maintaining a local copy in a persistent tsv file. https://hash-archive.org is made by Ben Trask, in association ArchiveLabs, which provides the Internet Archive. Hash Archive is open source (MIT licensed) software that can be easily be deployed independently, e.g. https://hash-archive.carlboettiger.info. However, many scientific data repositories can already support queries by content hash, which means we can retrieve objects by content identifier from persistent archives without relying on this specific software.

Sofware Heritage

For example, the Software Heritage Project [@softwareheritage] periodically archives the content of all public repositories on GitHub (and elsewhere, including the packages in the Comprehensive R Archive Network, CRAN), and also allows us to query for any object in it's archive using the SHA-256 signature. We can query the Software Heritage index to see if anyone has already written the popular example mtcars data to a csv and uploaded that to a public GitHub repository or other location indexed by SoftwareHeritage:

query <- sources_swh("hash://sha256/c802190c43e02246da9c6c9c3f13a58f076cc6b77922f4d9766a3c6bdb1b52bd")
url <- query$source[[1]]

Indeed it has! While some data products will be too large to make available through GitHub or BitBucket repositories, it is worth noting that users who deposit data to those locations can trigger a Software Heritage to generate a persistent snapshot of all the content which can then be queried in this way by using the store_swh() function from contentid, or the Software Heritage API or web interface.

DataONE Network

The DataONE API also allows us to query for any object in it's system by content hash (checksum), but unlike Software Heritage Archive, many objects have only a SHA1 or MD5 sum recorded. This is not an obstacle for new uploads, which can easily opt into using sha256. Even more conveniently, the DataONE API allows us to specify our own identifiers, (provided they don't conflict with anything already in the DataONE registry). This allows us to upload and download data to DataONE repositories such as the KNB using content-based identifiers, like so:

library(dataone)
library(datapack)
library(mime)
dataone_node <- function(){
  if(!is.null(getOption("dataone_test_token")))
    return( dataone::D1Client("STAGING2", "urn:node:mnTestKNB") )
  dataone::D1Client("PROD", "urn:node:KNB")
}

publish_dataone <- function(file){
  id <- as.character(contentid::content_id(file))
  d1c <-  dataone_node()
  d1Object <- new("DataObject", id, format=mime::guess_type(file), filename=file)
  d1Object@sysmeta@checksum <- gsub("^hash://\\w+/", "", id)
  d1Object@sysmeta@checksumAlgorithm <- "SHA-256"
  dataone::uploadDataObject(d1c, d1Object, public=TRUE)

  id

}

Having defined our helper function, we must also create an account / log in to the DataONE portal (https://search.dataone.org for the production system, or https://search-stage-2.test.dataone.org/ for the testing system) and copy over our credential token from the user settings. Note that these tokens expire every 18 hours. Then we can use this helper to publish any CSV file to DataONE:

readr::write_csv(richness_forecast, "richness_forecast.csv")
publish_dataone("richness_forecast.csv")

Similarly, we can define a function to resolve our object from the DataONE archive using the content-based identifier:

resolve_dataone <- function(id, url_only = FALSE){
  d1c <-  dataone_node()
  paste0(d1c@cn@baseURL, "/v2/resolve/", utils::URLencode(id, TRUE))
}

url <- resolve_dataone("hash://sha256/c802190c43e02246da9c6c9c3f13a58f076cc6b77922f4d9766a3c6bdb1b52bd")

(Note that this example is run against the testing server, and so the uploaded data will not be accessible on the production node.)

Metadata and FAIR publishing

These examples illustrate only how identifiers can be registered and resolved. Published data ought to meet the FAIR principles: Findable, Accessible, Interoperable, and Reusable. Registering an identifier in this way only makes it accessible. Using a recognized, open, standard data format such as .csv serialization promotes interoperability. To be findable and re-usable, however, requires appropriate metadata accompany the data. Such metadata files can refer to the content they describe by using the content identifiers proposed here. For iterative forecasting of ecologically relevant data, we recommend the EFI Standards extension of the Ecological Metadata Language (EML), https://github.com/eco4cast/EFIstandards.