How rfair works: methodology and architecture

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(rfair)

This vignette describes what rfair measures and how, in enough detail to interpret and reproduce its scores. For a quick tour see vignette("rfair"); for the reuse/sensitivity extensions see vignette("beyond-fuji").

1. Background: FAIR, the FAIRsFAIR metrics, and F-UJI

The FAIR principles (Wilkinson et al. 2016) state that research data should be Findable, Accessible, Interoperable, and Reusable. They are aspirational; to assess a real data object you need measurable indicators.

The FAIRsFAIR project turned the principles into a concrete, testable metric set, and the F-UJI tool (Devaraju & Huber, PANGAEA) implemented an automated assessment service for them. F-UJI is a Python web service: you send it a persistent identifier (PID) and it returns per-metric scores.

rfair is a native R reimplementation of the F-UJI metrics (version 0.8). It performs the whole assessment in R, with no external server, so assessments are scriptable, reproducible, and embeddable in R pipelines. The original rfair package (v1) was only an HTTP client for an F-UJI server; this version (v2) is the engine itself.

2. The assessment pipeline

A single call to assess_fair() runs this pipeline:

identifier
   │  id_parse()            scheme detection + normalization + resolver URL
   ▼
resolution                  content-negotiated GET, follow redirects -> landing page
   │  resolve_landing_page()
   ▼
harvesting                  a sequence of collectors, in priority order:
   │   collect_html_meta()      embedded JSON-LD (schema.org), Dublin Core,
   │                            OpenGraph, Highwire meta tags
   │   collect_signposting()    HTTP Link header + <link rel> typed links
   │   collect_datacite()       DataCite JSON via content negotiation
   │   collect_xml()            DataCite XML, Dublin Core, MODS, EML, ISO19139
   │   collect_rdf()            JSON-LD (native) and Turtle/RDF-XML (via rdflib)
   │   collect_github()         GitHub repository + codemeta.json + CITATION.cff
   │   harvest_data()           HEAD on data links for MIME type and size
   ▼
mapping + merging           each source is mapped to one reference schema and
   │  merge_metadata()         merged (first-non-empty for scalars; union for
   │                           lists; longer-but-similar replacement)
   ▼
evaluation                  one evaluator per metric inspects the merged metadata
   │  run_evaluators()         and the resolved identifier, scoring each test
   ▼
scoring                     per-test scores -> per-metric -> F/A/I/R -> overall
   │  get_assessment_summary()
   ▼
fair_assessment             tidy S3 object (print / summary / as.data.frame /
                            as_fuji_json / as_rdf)

Identifier handling

id_parse() recognizes DOIs, Handles, ARKs, URNs, UUIDs, identifiers.org PIDs, w3id, and plain URLs, normalizes them, and constructs a resolver URL. Persistence is inferred from the scheme.

id_parse("https://doi.org/10.5281/zenodo.8347772")[c("preferred_schema", "is_persistent", "identifier_url")]

Harvesting and content negotiation

Different repositories expose metadata in different ways. rfair asks for several representations of the same object via HTTP content negotiation (the Accept header) and scrapes the landing page, then merges everything into a single reference schema (~30 elements: creator, title, publisher, publication_date, license, access_level, object_content_identifier, related_resources, ...). When two sources disagree, scalars keep the first non-empty value (replaced only by a longer, sufficiently-similar string), and list-valued elements are unioned.

The metric model

Metrics are data-driven: their definitions, tests, scores, and maturity levels come from the bundled FAIRsFAIR YAML, not from hard-coded R logic.

rfair_metric_versions()      # bundled metric versions
# v0.8 has 17 metrics across F/A/I/R (one row each):
nrow(as.data.frame(assess_fair("https://doi.org/10.5281/zenodo.8347772", resolve = FALSE)))

Each metric has one or more tests. A test contributes a score and a maturity level (a CMMI level 0–3: incomplete, initial, moderate, advanced) when it passes. Metrics use one of two scoring mechanisms:

The criterium engine (criterium_engine.R) builds each metric's result from the YAML and lets evaluators mark tests passed; as_fuji_json() then emits a payload matching the upstream F-UJI FAIRResults schema.

3. What each FAIR category measures (v0.8)

| | metric | what rfair checks | |---|---|---| | F | F1-01MD | identifier follows a unique scheme (URI/URN/UUID/HASH/PID) | | | F1-02MD | identifier is persistent and registered (resolves) | | | F2-01M | core descriptive metadata present (creator, title, id, date, publisher, type, summary, keywords) | | | F3-01M | metadata links to the downloadable data content | | | F4-01M | metadata offered in a search-engine-ingestible way (embedded JSON-LD / meta tags) | | A | A1-01M | access level / rights are stated in metadata | | | A1-02MD | metadata and data are retrievable via their identifiers | | | A1.1-01MD | identifiers use a standardized communication protocol (http/https/ftp) | | | A1.2-01MD | the protocol supports authentication where needed | | I | I1-01M | metadata uses a formal, machine-readable representation (JSON-LD/RDF/XML) | | | I2-01M | metadata uses terms from registered semantic vocabularies | | | I3-01M | qualified references to related entities (with relation types) | | R | R1-01M | metadata describes the data content (type, format/size) | | | R1.1-01M | a machine-readable license is present and SPDX/CC-recognized | | | R1.2-01M | provenance information (creators, dates, contributors) | | | R1.3-01M | a community-/discipline-endorsed metadata standard is used | | | R1.3-02D | data is in a recommended (scientific/open/long-term) file format |

The score for a category is the sum of earned over total across its metrics; the overall FAIR score is the sum across all 17, and the maturity is the (clamped) mean of the per-category maturities.

# the canonical principle definitions these metrics map to
fair_principles("I")[, c("id", "definition")]

4. Software FAIR (FRSM)

For software objects, rfair also bundles the FRSM (FAIR for Research Software) metric set; select it with metric_version = "0.7_software". The GitHub harvester inspects the repository file tree for signals (a license file, tests, CI workflows, dependency manifests, a registry DOI, a release version, contributors) and the 17 FRSM evaluators score from them. FRSM scoring is heuristic and not yet validated against an upstream software-FAIR reference.

5. Fidelity to F-UJI

Because rfair reimplements an existing scoring engine, it includes a non-CRAN conformance harness. tests/conformance/run.R runs identifiers through both rfair and a locally run, version-matched F-UJI server and compares per-metric earned scores. A manual run on 2026-06-16 against F-UJI 4.0.0 (metrics v0.8) measured 94.1% on a Zenodo DOI (16/17 metrics exact) and 85.3% across PANGAEA and Dryad; the consistent divergence was the data file-format metric (F-UJI uses Tika content detection where rfair uses an HTTP HEAD). This reference-server comparison is not reproduced by CI yet. A separate harness (tests/conformance/parity.R) compares the R engine with the browser TypeScript engine on registry-derivable metrics after the webapp branch is checked out alongside the package.

6. Beyond F-UJI

rfair adds checks that automated FAIR tools usually miss, motivated by peer review of a COVID-19 FAIR study: license reusability (not just presence) with the (Re)usable Data Project taxonomy, controlled-access/sensitive-data flagging, identifier hygiene, and the FAIR-TLC (Traceable, Licensed, Connected) extension. See vignette("beyond-fuji").

7. Limitations

References



Try the rfair package in your browser

Any scripts or data that you put into this service are public.

rfair documentation built on July 1, 2026, 5:07 p.m.