The dataset S3 Class

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(dataset)

In the R language, datasets are usually contained in a data.frame() object, or in one of their modernized versions. For example, tibble::tibble() or data.table::data.table() are inherited from the base data.frame().

The base data.frame() constructor, like most base R types, is very flexible. It allows the use of any kind of metadata attached to the object.

foo <- data.frame( x = c(1,2), y = c(3,4))
attr(foo, "Title") <- "My Foo Object"
attributes(foo)

For reproducible research, publication, or linking resources on the web, the standardization of metadata is critically important. The aim dataset() class is a modernized data.frame that has standardized attributes.

dataset_title(iris_dataset)
publisher(iris_dataset)
library(here)
library(knitr)

According to the RDF Data Cube Vocabulary DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:

| Information | dataset | | :--: | ----| | dimensions | first column section of the dataset | | measurements | second column section of the dataset | | attributes | third column section of the dataset | | reference | attributes of the R object |

rd_e_gerdtot <- eurostat::get_eurostat('rd_e_gerdtot')
head(rd_e_gerdtot)

Dimensions

Dimensions are usually needed in data analysis because they are used to subsetting (slicing) the dataset. They contain information about the reference time period and geographical area.

In a dataset that has homogeneous dimensions (all data relate to the year 2022 and the area of the United States), you could move the dimensions into the attributes of the R object, or simply omit them. However, dimensions are critically important for filtering out the observations (measurements) that you want to work with or to correctly join (integrate) datasets. If you want to create a composite indicator from two datasets that are related to the United States and the year 2022, you do not want to match measurements about 2021 or Canada accidentally.

dimensions(rd_e_gerdtot) <- c("geo", "time", "sectperf")

Measurements

The measurements are the actual observed values. In a long-form tidy dataset you usually have only one: 'value'

measures(rd_e_gerdtot) <- "value"

Attributes

Attributes are similar to dimensions, but they can be fully static and constant in a dataset. You may have measurements for the same reference area and time available in both kilograms and tons in the same dataset, in which case you will likely use filter the correct unit of measure when you do analytical work or join (integrate) the data.

If your measurement unit is always millimeters (like in the iris dataset), it is tempting to treat this as a dataset-wide constant (and therefore move it to the attributes of the data frame R object), but we do not recommend this approach. Imagine that you want to join this dataset with some other data that is measured in centimeters or inches, or a dataset that has values in both millimeters and centimeters. To correctly match your data you will be filtering on attributes, too.

Attributes that may vary across observations (rows) should remain in the dataset in the datacube model. To avoid confusion with the base R attributes() function, we named the function that sets the attributes within a dataset to attributes_measures().

attributes_measures(rd_e_gerdtot) <- "unit"
datacite(rd_e_gerdtot)

Reference and FAIR metadata

Our dataset R package aims to increase the Findability, Accessibility, Interoperability, and Reuse of digital assets, particularly datacubes and datasets used in statistics and data analysis. The FAIR principles "…emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data."

This is the role of the reference metadata in the RDF Data Cube Vocabulary and the SDMX data cube model. We generally keep the reference metadata as attributes() of the R object, because they do not relate to the rows (observations) of the data, but the entire set of data. However, omitting all reference metadata from the columns is not a good practice if you want your data to be used in a knowledge graph (or the semantic web, or the Web 3.0.)

toc <- eurostat::get_eurostat_toc()
rd_e_gerdtot_reference <- toc[which(toc$code == "rd_e_gerdtot"),]

datacite_add(rd_e_gerdtot, 
             Title = 'GERD by sector of performance', 
             Creator = person("Daniel", "Antal"), 
             Identifier = 'eurostat_rd_e_gerdtot', 
             Publisher = 'Eurostat', 
             PublicationYear = substr(rd_e_gerdtot_reference$`last update of data`, 7,11), 
             Subject = subject_create("Reserach", 
               subjectScheme = "LC Subject Headings", 
               schemeURI = "http://id.loc.gov/authorities/subjects", 
               valueURI = "http://id.loc.gov/authorities/subjects/sh85113021"), 
             Language = "English")
datacite(rd_e_gerdtot)

Following the datacube model, our datasets are data frames with clearly defined dimensions (time, geo, sex), measurements (value), and attributes (unit, freq, status). In this example, all dimensions and values are following the SDMX attribute definition, i.e. they have a standardized, natural language independent codelist. (To use these codelists, use the statcodelist data package.)

include_graphics(here("vignettes", "dataset_structure.png"))

Row identifiers and dimensions reduction

R objects inherited from the base data.frame() have row (observation) identifiers as row.names() attributes. This works well if you work with a single data frame, but this approach is not sufficient to identify observations if you work with several data frame, and you want to organize them into a database, or join them into new tables, or you want to make them available on a knowledge graph.

When joining data tables or working in a relational database, you need unique identifiers for each unique observation unit in your system. If you want to broaden the usability of your data to the entire semantic web, and use it as linked data, you need a truly unique identifier (URI) for each observation.

We recommend the use of an explicit row identifier. The popular modern R data frames, tibble::tibble() and data.table::data.table() use row identifiers.

One of the advantages of using an explicit row identifier is that it can form the root for minting a URI for the entire dataset by collapsing all dimensions and attributes into a concatenated string starting with the row identifier. This will make your dataset ready to be used in triplets, a strict, tidy, three-column long-form dataset used in linked open data applications. As mentioned earlier, in homogeneous (or homogeneously subsetted) datasets, you could move the dimensions and the attributes out from the data frame cells into the descriptive attributes. However, if you want to work with linked data, you must have all structural information present in the data cells, because this makes it possible that different data publisher's data can be linked together without having a utopistic, global database map.

In the following example, we concatenate the rowid, and the time, geo and sex dimensions into a single URI. We can do this because in a well-organized dataset the combination of dimensions is unique (otherwise, we would be just simply duplicating an observation.) However, adding the attributes to the URI would be superfluous because their combination is not unique in the observations.

include_graphics(here("vignettes", "RDF_chart_1.png"))

The From dataset To RDF vignette article shows you how to organize your data into strict, tidy, three-column triples that can be serialized into RDF data.



Try the dataset package in your browser

Any scripts or data that you put into this service are public.

dataset documentation built on March 31, 2023, 10:24 p.m.