Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The aim of the 'dataset' package is to make tidy datasets easier to release, exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, well-described, interoperable datasets into release and reuse ready form. It applies a subjective, R-language compatible interpretation of the W3C Data Cube Vocabulary based on the statistical SDMX data cube model^[RDF Data Cube Vocabulary, W3C Recommendation 16 January 2014 https://www.w3.org/TR/vocab-data-cube/, Introduction to SDMX data modeling https://www.unescap.org/sites/default/files/Session_4_SDMX_Data_Modeling_%20Intro_UNSD_WS_National_SDG_10-13Sep2019.pdf]. We enrich these statistical data exchange definition with similar definitions from information science and the exchange of scientific resources on open science repositories by applying the Dublin Core and DataCite metadata specifications for R attributes to improve the findability, accessibility, interoperability and reusability of the datasets.

We organize and format standard tabular data R objects (data.fame, data.table, tibble, or even well-structured lists like json and json-ld) become highly interoperable and can be placed into relational databases, semantic web applications, archives, repositories. This puts the FAIR principles into the practice of an R environment workflow: they make these tabular data objects more findable, more accessible, interoperable and reusable.

Statement of need

Open science repositories and analyst computers are full with datasets that have no provenance, structural or referential data. We believe that whenever possible, metadata should be machine-recorded when possible, and should not be detached from an R object.

There are several R packages that have overalapping goals or functionality to dataset, but they use a different philosophy. When exporting to different files, they should be written as exported, but no sooner, and preferably into the file that contains the data.

The dataspice package aims to create well-defined and referenced datasets, but follows a different schema and a different publication strategy. The dataset package follows the more restrictive W3C/SDMX "DataSet" definition within the datacube model, which is better suited to synchronize with statistical data sources. Unlike dataset, it uses a manual metadata entry from CSV files. (See the documentation of the dataspice package.)

The dataset package aims for a higher level of reproducibility, and does not detach the metadata from the R object's attributes (it is aimed to be used in other reproducible research pacakges that will directly record provenance and other transactional metadata into the attributes.) We aim to bind together dataspice and dataset by creating export functions to csv files that contain the same metadata that dataspice records. Generally, dataspice seems to be better suited to raw, observational data, while dataset for statistically processed data.

The intended use of dataset is to start correctly record referential, structural and provenance metadata retrieved by various reproducible science packages that interact with statistical data (such as the rOpenGov packages eurostat and iotables, or the OECD package.

Neither dataset or dataspice are very suitable of or documenting social sciences survey data, which are usually held in datasets. Our aim is to connect dataset, declared and DDIwR to create such datasets with DDI codebook metadata. They will create a stable new foundation of the retroharmonize package to create new, well-documented and harmonized statistical datasets from the observational datasets of social sciences surveys.

The zen4R package provides reproducible export functionality to the zenodo open science repository. Interacting with zen4R may be intimidating for the casual R user as it uses R6 classes. Our aim to provide an export function that completely wraps the workings of zen4R when releasing the dataset.

In our experience, while the tidy data standards make reuse more efficient by eliminating unnecessary data processing steps before analysis or placement in a relational database, the application of DataSet definition and the datacube model with the information science metadata standards make reuse more efficient with exchanging and combining the data with other data in different datasets.

Another use case we are having in mind is the releasing of the data on a knowledge graph, or in an API that uses a well-defined schema. This is the primary motivation for choosing the datacube model, because it is used by most statistical APIs. The datacube model allows easy "slicing", or dimension reduction, that allows itself to an easy modeling into RDF(S). (See later, and see a more comprehensive From dataset To RDF vignette, which in turn heavily borrows from A tidyverse lover’s intro to RDF by Carl Boettiger.)

Development dilemma

  1. The current development dilemma of dataset if it should develop the dataset() into an s3 class, and if so, should it be inherited from data.frame or tibble, keeping in mind the desire to work well with data.table objects, too.

  2. If not, then the dataset() functions should be clearly seen as helper functions to augment data.frame, tibble and data.table objects for easier use in data publication, release, placement in databases and on knowledge graphs.

The dataset constructor

In the first step, we organize tidy datasets according to the data cube model originally developed by the SDMX for exchanging statistical data, by clearly stating which columns are to be used for grouping, filtering and for calculations. In this step, we clearly state which tidy columns are to be used as “measurements”, “dimensions” and “attributes” of the dataset. This step will greatly simplify how we obtain tidy subsets from a dataset, or how we reduce the dimensions of a dataset to a triad or a quad to be placed on a knowledge graph.

In the second step, we add metadata is attributes to the R object (regardless if it is a data.frame or an inherited tibble/data.table object) which make the finding, understanding and the exchange of data (including refreshing data from source), its placement in a database or on a knowledge graph easier. This step will make the dataset easier to understand both for human analysis and for computers, and greatly increase their reuse potential, or the ability to review and replicate them in reproducible research.

A mapping of R objects into these models has numerous advantages: it makes data importing easier and less error-prone; it leaves plenty of room for documentation automation; and it makes the publication of results from R following the FAIR principles much easier.

Data semantics

The tidy data principles are seemingly very simple: each variable is a column, each observation is a row, and place all new units of measurement into a new table. However, the steps of tidying up a messy dataset require practice, like solving a Rubik cube–--take a long time to do quicky.

The tidy data principles frame Codd's relational algebra, particularly the third normal form (3NF) into statistical language, and make the storage and reuse of data in relational databases optional. It suggests a format for tabular data and data semantics that makes a very important, and usually the most time (resource) consuming part of data analysis far more efficient, because the analyst does not have to reinvent the wheel all the time.The data cube model can be seen as an extension of the tidy model that makes the storage, reuse, and exchange of data among different database, or their placement on a knowledge graph optimal.

The wording of the tidy data principles is not sufficiently exact for the data cube model. THe 3NF form stipulates that the (e.g. table columns) are functionally dependent on solely the primary key, which means that we must have a primary key that is locally unique in the table. The primary key must be a unique combination of the dimension and attribute columns, including the row identifier (rowid), if it exists. This means a lot more than the uniqueness of the measurement unit.

A usual dimension is the reference time period and the reference geographical area. Consider a dataset that contains population growth numbers from a county of a country, collected in the time period of 2010..2015 quarterly, and in 2016:2020 annually, with the county borders changing in 2014, and the unit of measure changed from the number of people in thousands to the number of people in millions. The meausurement of the reference time period and the aggregation area has changed, too.

library(dataset)
my_dataset <- dataset (x = data.frame (
  time = rep(c(2019:2022),4),
  geo = c(rep("NL",8), rep("BE",8)),
  sex = c(rep("F", 4), rep("M", 4), rep("F", 4), rep("M", 4)),
  value = c(1,3,2,4,2,3,1,5, NA_real_, 4,3,2,1, NA_real_, 2,5),
  unit = rep("NR",8),
  freq = rep("A",8)),
  Dimensions = c("time", "geo", "sex"),
  Measures = "value",
  Attributes = c("unit", "freq"),
  sdmx_attributes = c("sex", "time", "freq"),
  Title = "Example dataset",
  Creator = person("Jane", "Doe")
)

my_dataset <- dataset_local_id(my_dataset)
my_dataset

The data cube model

The tidy data is designed to be easily handled in relational databases and it is an ideal format for analysis in a closed system – for example, when all your data for the analysis is stored in a well-designed relational database. This semantics is however insufficient for exchanging data with several data sources, a problem that has been attracting solutions for a long time by statistical agencies and designers of hyperlinked, web-based products.

The dataset package is inspired by the way statisticians have been for a long-time organizing data for publications: the data cube model. The data cube model has additional, non-normative prescriptions for the structure of the (tidy) dataset: columns must be grouped into measurements, dimensions, and attributes. The dimensions are non-measured variables, such as reference geographical areas, reference time periods that are usually used to filter out subsets (slices) from a datacube. Attributes provide information about the measurements, such as unit of information, estimation method, and usually can be seen as metadata for modeling purposes: they contain observations (measurements) level metadata that is also often used for filtering (selecting actual, imputed or missing cases, for example.) Measurements are in fact the observations that we want to analyze.

The data cube model has been long the preferred format to organize data for analytical use by SDMX. Statistical agencies exchange with each other and publish data in datasets that are conforming the data cube model. By organizing your R tabular data like statisticians, you make your work easier to synchronize with statistical data sources: you can download or exchange date in this format. Following the data cube model helps you to understand more easily the dataset and bring your data to be analyzed to a tidy form in one or two steps --- usually using attributes for subsetting, dimensions for grouping, and measurements to carry out arithmetic calculations like computing the group average values.

Interoperability

While tidying aims and preventing reinventing the wheel and many times repeating the work of bringing the data from a data storage, such as a database, into a form that an analyst can work with, reproducible research has further aims that add to the quality and inefficiency of the analysis: it aims to make the tidy datasets and the research output easier to reuse, and to review and replicate for quality control and peer review. Data scientists know that however simple the tidy data rules are, it takes a long time to get a feeling to easily put any messy, stored data into a tidy format. To put the data into a format that is easy to review, and which offers an easy replication is even harder. As the famous article by Monya Baker, 1,500 scientists lift the lid on reproducibility shows, the vast majority of scientists have experienced the problem that seemingly replicable, published scientific data cannot be replicated, and more worryingly, the majority of them expressed doubt about their ability to replicate their own findings. Datasets that are not easily replicated are also not easily used as building blocks for further joined, more composite data.

The lack of interoperability is often connected to the barrier of understanding a dataset. Having a simple term in the column heading, and knowing that observations are in rows, variables are in columns is not always a sufficient help to understand the meaning of the dataset. The data cube model, which explicitly states dimensions, attributes for the measurements greatly increase the understandability.

“Data is only potential information, raw and unprocessed, prior to anyone actually being informed by it.” ---Jeffrey Pomerantz, Metadata

Understanding data (stored in a tabular or other format) requires metadata, information by which the data can be represented in a simpler form. The simplest and most useful metadata for this purpose is a title or a label in the RDF model, a single, human readable sentence that describes the data.

The data cube model describes attributes, which can be seen as metadata about observations, to be placed into the tabular format, and dataset level metadata, such as the title. What is missing from a tidy dataset in a data.frame, tibble or data.table format is the metadata.

Attributes

In R, attributes of an object are designed to record metadata about the object – if it is a tabular data, then about the dataset itself. For a tidy dataset, at least the names of the columns, the names of the rows, and the name of the object in memory are recorded. R places very few restrictions on how to add metadata in the attribute fields of an object, or place new fields – you can add far more metadata than the size of the actual data frame if you like.

The dataset package defines a pseudo-class that enriches data frames with standardized metadata used in the RDF and SDMX datacube model, and the Dublin Core and DataCite metadata standards of libraries that make datasets easier to find, attribute, and more interoperable and reuseable by providing more information about their contents.

The dataset class (see: dataset())is a pseudo-class. Although it is defined as a class inherited from the base R data.frame S3 class, in fact, it does not alter the way you interact with the data, and for analytical purposes the object remains a data.frame. It is also a pseudo-class because the functions built in the dataset constructor work exactly as well on tibbles or data.table objects, because they only touch metadata attributes of the R object.

The dublincore() function and its helper functions adds metadata as attributes to any R object The Dublin Core Elements are originally library standards that facilitate finding resources (files) in libraries. The RDF datacube definition, which aims to enable machines to connect datasets automatically, refers to a part of the Dublin Core definition ---however, it is a good practice to add all Dublin Core elements to a dataset.

The datacite() functions and its helper functions adds mandatory and optional metadata from the DataCite standard, which is often preferred by open science data repositories aimed to a public storage of datasets. The DataCite standard is newer than Dublin Core, and was specifically designed to facilitate finding, accessing, and reusing data. It greatly overlaps with Dublin Core but contains further information that is particularly useful for datasets (for example, estimated dataset object size in memory or data storage.)

Both the Dublin Core Metadata Elements definition 1.1 and DataCite Metadata Scheme 4.4 refer to standard information elements, properties, for example, the ISO standard of describing the size of an object in computers, or the ISO standard of describing reference time or reference geographical areas for the data collection. Some of the helper functions of dublincore_add() and datacite_add() already validate the correct formatting of this information, some will be added in later versions.

iris_datacite  <- datacite_add(
  x = dataset(iris, Measures = c(1:4), Attributes = "Species"),
  Title = "Iris Dataset",
  Creator = person("Anderson", "Edgar", role = "aut"),
  Publisher = "American Iris Society",
  Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
  PublicationYear = 1935,
  Description = "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.",
  Language = "en")

In R, objects can have arbitrary attributes. For example, a data.frame has a class attribute that tells functions to treat the object as a data.frame.Under the hood, you still keep your data frame object of choice, the good old base r ?data.frame, or the more modern data.table or tibble.

We add descriptive metadata conforming the Dublin Core and DataCite standard as data frame attributes, because they must clearly describe the dataset. The dataset-level attributes do not interfere with the tidy data concept, because the tidy data concept relates to the contents of the data frame.

datacite(iris_datacite)

Linked Data: From dataset To RDF

The assumption of third normal form (3NF), or a more precise form of tidiness, is that each observation has a primary key locally, i.e. in the dataset, is insufficient. For updating a dataset, i.e. joining new observations, or new columns, we need to work with globally universal observation identifiers, or URIs. (Otherwise, we must map the various locally unique keys to each other, i.e. we need to know various database schemas.) The idea of linked data, or RDF, is to reduce the data to a triple, and move all metadata, including referential or structural metadata into this triple.

my_dataset_uri <- dataset_uri(my_dataset, 
                              prefix = "https:://example.org/my_iris", 
                              keep_local_id = FALSE)
my_dataset_uri
nq_file <- file.path(tempdir(), "triple_file.nq")
my_triple <- subset(my_dataset_uri, select = c("URI", "value", "unit"))
library(rdflib)
rdf <- rdf()

for ( i in seq_len(nrow(my_triple))) {
  rdf_add(rdf = rdf, 
          subject = "", 
          predicate = my_triple$URI[i], 
          object = my_triple$value[i])
}

rdf_serialize(rdf, doc = nq_file)
rdf_parse(nq_file) 

For a more comprehensive From dataset To RDF, which in turn heavily borrows from A tidyverse lover’s intro to RDF by Carl Boettiger.

Planned functionality

The dataset package is aiming to make tidy datasets easier to release, exchange and reuse. The motivation of the ecosystem of the dataset, statcodelists, and the dataobservatory packages is to provide a comprehensive toolkit that helps many reproducible research tasks.

The dataset package already has an offspring, the surveydataset package, which is aimed to connect the dataset functionality with some elements of DDI, a metadata standard to work with social sciences survey data, and to connect the dataset package with the retroharmonize package that harmonizes data from different surveys.

Our real goal is to facilitate efficient reproducible research workflows that perform flawlessly tasks that most R users, even scientific researchers, are unfamiliar with---or if they are familiar with them, they find it boring, because they will usually not get credited for it as analyst or researchers.

Our datasets:

The datacube model is used by the W3C consortium for describing datasets, tabular data resources that can be easily connected via the internet infrastructure. RDF , Because the SDMX standards pre-date the definition of RDF, it took a long time sufficiently harmonize RDF and SDMX to become a standard for statistical data that can be easily exchanged by humans and computers alike, therefore making it ideal for reproducible research.

We also aim to replace the survey in the retroharmonize survey harmonization package to an inherited dataset that is optimized to contain social sciences survey data.

The connecting statcodelists packages facilitates further reproducibility with standardized, natural language independent codelists for categorical variables. Such categorical variables are correctly interpreted in a wide array of statistical applications and can be easily joined with data from many countries irrespective of the primary sources' language.

See: Linked SDMX Data

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.

Capadisli, Sarven, Sören Auer, and Axel-Cyrille Ngonga Ngomo. 2015. “Linked SDMX Data: Path to High Fidelity Statistical Linked Data.” Semantic Web 6 (2): 105–12. https://doi.org/10.3233/SW-130123.

Core, Dublin. 2020. “DCMI Metadata Terms.” http://dublincore.org/specifications/dublin-core/dcmi-terms/2020-01-20/.

Group, DataCite Metadata Working. 2021. “DataCite Metadata Schema Documentation for the Publication and Citation of Research Data and Other Research Outputs. Version 4.4.” DataCite e.V. https://doi.org/10.14454/3w3z-sa82.

Pomerantz, Jeffrey. 2015. Metadata. The MIT Press Essential Knowledge Series. Cambridge, MA, USA: MIT Press.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.



Try the dataset package in your browser

Any scripts or data that you put into this service are public.

dataset documentation built on March 31, 2023, 10:24 p.m.