README.md

The dataset R Package

lifecycle CRAN_Status_Badge CRAN_time_from_release Status at rOpenSci Software Peer
Review DOI devel-version dataobservatory Follow
rOpenGov Codecov test
coverage pkgcheck AppVeyor build
status

The primary aim of dataset is create well-referenced, well-described, interoperable datasets from data.frames, tibbles or data.tables that translate well into the W3C DataSet definition within the Data Cube Vocabulary in a reproducible manner. The data cube model in itself is is originated in the Statistical Data and Metadata eXchange, and it is almost fully harmonized with the Resource Description Framework (RDF), the standard model for data interchange on the web[^1].

A mapping of R objects into these models has numerous advantages:

  1. Makes data importing easier and less error-prone;
  2. Leaves plenty of room for documentation automation, resulting in far better reusability and reproducibility;
  3. The publication of results from R following the FAIR principles is far easier, making the work of the R user more findable, more accessible, more interoperable and more reusable by other users;
  4. Makes the placement into relational databases, semantic web applications, archives, repositories possible without time-consuming and costly data wrangling (See From dataset To RDF).

Our package functions work with any structured R objects (data.fame, data.table, tibble, or well-structured lists like json), however, the best functionality is achieved by the (See The dataset S3 Class), which is inherited from data.frame().

Installation

You can install the development version of dataset from Github:

remotes::install_github('dataobservatory-eu/dataset')

or install from CRAN:

install.packages('dataset')

Getting started

The dataset() constructor creates a dataset from a data.frame or similar object.

library(dataset)
#> 
#> Attaching package: 'dataset'
#> The following object is masked from 'package:base':
#> 
#>     as.data.frame
my_iris_dataset <- dataset(
  x = iris, 
  Dimensions = NULL, 
  Measures = c("Sepal.Length", "Sepal.Width",  "Petal.Length", "Petal.Width"), 
  Attributes = "Species", 
  Title = "Iris Dataset", 
  Issued = 1936
)

is.dataset(my_iris_dataset)
#> [1] TRUE

Then you add the metadata:

my_iris_dataset <- dublincore_add(
  x = my_iris_dataset,
  Creator = person("Edgar", "Anderson", role = "aut"),
  Publisher = "American Iris Society",
  Source = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
  Date = 1935,
  Language = "en"
)

print(my_iris_dataset)
#> Iris Dataset by Edgar Anderson
#> Published by American Iris Society
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> 
#> ... 140 further observations.
#> Source:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
summary(my_iris_dataset)
#> Iris Dataset by Edgar Anderson
#> Published by American Iris Society
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#>                 
#> Source:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
metadata <- dublincore(x=my_iris_dataset)
#> Title: Iris Dataset 
#> Publiser:  American Iris Society  | Source:  https://doi.org/10.1111/j.1469-1809.1936.tb02137.x  | Date:  1936  | Language:  eng  | Identifier:   | Rights:   | Description:   | 
#> names:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species 
#> - dimensions: <none>
#> - measures: Sepal.Length (numeric)  Sepal.Width (numeric)  Petal.Length (numeric)  Petal.Width (numeric)  
#> - attributes: Species (factor)

Beware that the metadata variable is more structured than the printed version.

str(metadata)
#> List of 11
#>  $ names     : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#>  $ dimensions:'data.frame':  0 obs. of  4 variables:
#>   ..$ names      : chr(0) 
#>   ..$ class      : chr(0) 
#>   ..$ isDefinedBy: chr(0) 
#>   ..$ codeList   : chr(0) 
#>  $ measures  :'data.frame':  4 obs. of  4 variables:
#>   ..$ names      : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
#>   ..$ class      : chr [1:4] "numeric" "numeric" "numeric" "numeric"
#>   ..$ isDefinedBy: chr [1:4] "https://purl.org/linked-data/cube" "https://purl.org/linked-data/cube" "https://purl.org/linked-data/cube" "https://purl.org/linked-data/cube"
#>   ..$ codeListe  : chr [1:4] "not yet defined" "not yet defined" "not yet defined" "not yet defined"
#>  $ attributes:'data.frame':  1 obs. of  4 variables:
#>   ..$ names      : chr "Species"
#>   ..$ class      : chr "factor"
#>   ..$ isDefinedBy: chr "https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/"| __truncated__
#>   ..$ codeListe  : chr "not yet defined"
#>  $ Type      :List of 2
#>   ..$ resourceType       : chr "DCMITYPE:Dataset"
#>   ..$ resourceTypeGeneral: chr "Dataset"
#>  $ Title     :List of 1
#>   ..$ Title: chr "Iris Dataset"
#>  $ Date      : num 1936
#>  $ Creator   :Class 'person'  hidden list of 1
#>   ..$ :List of 5
#>   .. ..$ given  : chr "Edgar"
#>   .. ..$ family : chr "Anderson"
#>   .. ..$ role   : chr "aut"
#>   .. ..$ email  : NULL
#>   .. ..$ comment: NULL
#>  $ Source    : chr "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x"
#>  $ Publisher : chr "American Iris Society"
#>  $ Language  : chr "eng"

Development plans

This package is in an early development phase. The current dataset S3 class is inherited from the base R data.frame. Later versions may change to the modern tibble, which carries a larger dependency footprint but easier to work with. Easy interoperability with the data.table package remains a top development priority.

The datacube model in R

According to the RDF Data Cube Vocabulary DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:

| Information | dataset | |:------------:|--------------------------------------| | dimensions | first column section of the dataset | | measurements | second column section of the dataset | | attributes | third column section of the dataset | | reference | attributes of the R object |

Our dataset class follows the organizational model of the datacube, which is used by the Statistical Data and Metadata eXchange, and which is also described in a non-normative manner by the the RDF Data Cube Vocabulary. While the SDMX standards predate the Resource Description Framework (RDF) framework for the semantic web, they are already harmonized to a great deal, which enables users and data publishers to create machine-to-machine connections among statistical data. Our goal is to create a modern data frame object in R with utilities that allow the R user to benefit from synchronizing data with semantic web applications, including statistical resources, libraries, or open science repositories.

The The dataset S3 Class vignette explains in more detail our interpretation of the datacube model, and some considerations and dilemmas that we are facing in the further development of this early stage package.

Our datasets:

Code of Conduct

Please note that the dataset package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Furthermore, rOpenSci Community Contributing Guide - A guide to help people find ways to contribute to rOpenSci is also applicable, because dataset is under software review for potential inclusion in rOpenSci.

[^1]: RDF Data Cube Vocabulary, W3C Recommendation 16 January 2014 https://www.w3.org/TR/vocab-data-cube/, Introduction to SDMX data modeling https://www.unescap.org/sites/default/files/Session_4_SDMX_Data_Modeling_%20Intro_UNSD_WS_National_SDG_10-13Sep2019.pdf



Try the dataset package in your browser

Any scripts or data that you put into this service are public.

dataset documentation built on March 31, 2023, 10:24 p.m.