The dataset S3 Class
In dataset: Create Data Frames that are Easier to Exchange and Reuse

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dataset)

In the R language, datasets are usually contained in a data.frame() object, or in one of their modernized versions. For example, tibble::tibble() or data.table::data.table() are inherited from the base data.frame().

The base data.frame() constructor, like most base R types, is very flexible. It allows the use of any kind of metadata attached to the object.

foo <- data.frame( x = c(1,2), y = c(3,4))
attr(foo, "Title") <- "My Foo Object"
attributes(foo)

For reproducible research, publication, or linking resources on the web, the standardization of metadata is critically important. The aim dataset() class is a modernized data.frame that has standardized attributes.

dataset_title(iris_dataset)
publisher(iris_dataset)

library(here)
library(knitr)

According to the RDF Data Cube Vocabulary DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:

Observations: these are the measured values, and the cells of a data frame object in R.
Organizational structure: To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation. Datasets can have additional organizational structure in the form of slices as described in section 7.2.
Structural metadata: Metadata to interpret the data. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated? These metadata are provided as attributes and can be attached to individual observations, or to higher levels.
Reference metadata: Metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, or an endpoint where it can be accessed.

| Information | dataset | | :--: | ----| | dimensions | first column section of the dataset | | measurements | second column section of the dataset | | attributes | third column section of the dataset | | reference | attributes of the R object |

rd_e_gerdtot <- eurostat::get_eurostat('rd_e_gerdtot')
head(rd_e_gerdtot)

Dimensions

Dimensions are usually needed in data analysis because they are used to subsetting (slicing) the dataset. They contain information about the reference time period and geographical area.

In a dataset that has homogeneous dimensions (all data relate to the year 2022 and the area of the United States), you could move the dimensions into the attributes of the R object, or simply omit them. However, dimensions are critically important for filtering out the observations (measurements) that you want to work with or to correctly join (integrate) datasets. If you want to create a composite indicator from two datasets that are related to the United States and the year 2022, you do not want to match measurements about 2021 or Canada accidentally.

dimensions(rd_e_gerdtot) <- c("geo", "time", "sectperf")

Measurements

The measurements are the actual observed values. In a long-form tidy dataset you usually have only one: 'value'

measures(rd_e_gerdtot) <- "value"

Attributes

Attributes are similar to dimensions, but they can be fully static and constant in a dataset. You may have measurements for the same reference area and time available in both kilograms and tons in the same dataset, in which case you will likely use filter the correct unit of measure when you do analytical work or join (integrate) the data.

If your measurement unit is always millimeters (like in the iris dataset), it is tempting to treat this as a dataset-wide constant (and therefore move it to the attributes of the data frame R object), but we do not recommend this approach. Imagine that you want to join this dataset with some other data that is measured in centimeters or inches, or a dataset that has values in both millimeters and centimeters. To correctly match your data you will be filtering on attributes, too.

Attributes that may vary across observations (rows) should remain in the dataset in the datacube model. To avoid confusion with the base R attributes() function, we named the function that sets the attributes within a dataset to attributes_measures().

attributes_measures(rd_e_gerdtot) <- "unit"
datacite(rd_e_gerdtot)

Reference and FAIR metadata

Our dataset R package aims to increase the Findability, Accessibility, Interoperability, and Reuse of digital assets, particularly datacubes and datasets used in statistics and data analysis. The FAIR principles "…emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data."

This is the role of the reference metadata in the RDF Data Cube Vocabulary and the SDMX data cube model. We generally keep the reference metadata as attributes() of the R object, because they do not relate to the rows (observations) of the data, but the entire set of data. However, omitting all reference metadata from the columns is not a good practice if you want your data to be used in a knowledge graph (or the semantic web, or the Web 3.0.)

toc <- eurostat::get_eurostat_toc()
rd_e_gerdtot_reference <- toc[which(toc$code == "rd_e_gerdtot"),]

datacite_add(rd_e_gerdtot, 
             Title = 'GERD by sector of performance', 
             Creator = person("Daniel", "Antal"), 
             Identifier = 'eurostat_rd_e_gerdtot', 
             Publisher = 'Eurostat', 
             PublicationYear = substr(rd_e_gerdtot_reference$`last update of data`, 7,11), 
             Subject = subject_create("Reserach", 
               subjectScheme = "LC Subject Headings", 
               schemeURI = "http://id.loc.gov/authorities/subjects", 
               valueURI = "http://id.loc.gov/authorities/subjects/sh85113021"), 
             Language = "English")

datacite(rd_e_gerdtot)

Following the datacube model, our datasets are data frames with clearly defined dimensions (time, geo, sex), measurements (value), and attributes (unit, freq, status). In this example, all dimensions and values are following the SDMX attribute definition, i.e. they have a standardized, natural language independent codelist. (To use these codelists, use the statcodelist data package.)

include_graphics(here("vignettes", "dataset_structure.png"))

Row identifiers and dimensions reduction

R objects inherited from the base data.frame() have row (observation) identifiers as row.names() attributes. This works well if you work with a single data frame, but this approach is not sufficient to identify observations if you work with several data frame, and you want to organize them into a database, or join them into new tables, or you want to make them available on a knowledge graph.

When joining data tables or working in a relational database, you need unique identifiers for each unique observation unit in your system. If you want to broaden the usability of your data to the entire semantic web, and use it as linked data, you need a truly unique identifier (URI) for each observation.

We recommend the use of an explicit row identifier. The popular modern R data frames, tibble::tibble() and data.table::data.table() use row identifiers.

One of the advantages of using an explicit row identifier is that it can form the root for minting a URI for the entire dataset by collapsing all dimensions and attributes into a concatenated string starting with the row identifier. This will make your dataset ready to be used in triplets, a strict, tidy, three-column long-form dataset used in linked open data applications. As mentioned earlier, in homogeneous (or homogeneously subsetted) datasets, you could move the dimensions and the attributes out from the data frame cells into the descriptive attributes. However, if you want to work with linked data, you must have all structural information present in the data cells, because this makes it possible that different data publisher's data can be linked together without having a utopistic, global database map.

In the following example, we concatenate the rowid, and the time, geo and sex dimensions into a single URI. We can do this because in a well-organized dataset the combination of dimensions is unique (otherwise, we would be just simply duplicating an observation.) However, adding the attributes to the URI would be superfluous because their combination is not unique in the observations.

include_graphics(here("vignettes", "RDF_chart_1.png"))

The From dataset To RDF vignette article shows you how to organize your data into strict, tidy, three-column triples that can be serialized into RDF data.