Home

/

GitHub

/

In rOpenGov/openthl: R Tools for THL Open Data API

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(openthl)

About

This article desribes the phase of development at 2020/10. It describes several perspectives of the THL open data API and the openthl package.

For a description of the API, see THL open data API docs.

API queries and general data retrieval

API data types

The API can return the following type of data

JSON (API meta information: hrefs and names of datasets)
JSONP (dimensions of a dataset: data labels and classifications)
JSONSTAT (data)

General API queries

The main API interaction is coded in the file api.R. The function getFromAPI() is a general purpose data retrieval function. It takes as parameter the URL and the 'type' of the data, which is either 'meta' or 'data'. Depending on the type, the parsing of the response is handled by either jsonlite (meta) or rjstat (data) packages. String manipulation is utilised to handle the JSONP case after which jsonlite is used.

The general idea is that this main retrieval function is called by more specialised functions, which further parse the content into useful formats (eg. data.frames, specialised S3 classes).

url <- openthl:::api_data_url(path = "epirapo")
reslist <- openthl:::getFromAPI(url, type = "meta")
reslist

API URL's

There are two public base URLs for the API: beta and prod.

openthl:::url_base(type = "prod") # default
openthl:::url_base("beta")

The URL to the API is given by

url_api(type = "prod") # default

The function api_data_url() builds URLs which can be queried by getFromAPI(). The result is a character vector which has as an attribute the main API URL. It also has the S3 class "api-data-url".

The API can return csv or json type data, but I think it would make sense for the R package to only interact with JSON.

openthl:::api_data_url("epirapo", format = "json")

Relevant source codes:

api.R
urls.R

Development thoughts:

There is currently no clear plan on how to query the beta API.
the scope of the package could maybe be restricted to JSON data and hence all function arguments related to the format may not be needed.

API exploration and user starting point

API terminology

The API terminology includes the following hierarchy

aihealue (subject)
hydra (hydra)
kuutio (cube)

The subject is a bit like a schema. A single subject can include multiple hydras. A single hydra can include multiple cubes. A cube is a dataset with labels according to a single language. I believe that there are always as many cubes in the hydra as there are translations to the dataset.

User interaction with the terminology

thlSubject() lists all cubes belonging to a subject

subject <- "epirapo"
x <- thlSubject(subject)
x

thlDatasets() takes the output of thlSubject() and presents the same information, but parses the hrefs into columns. There is probably no good reason why thlDatasets() could not simply accept the subject character name as input, but currently it only accepts the object returned by thlSubject().

thlDatasets(x)

thlHydra() takes a subject name and a hydra name and returns hrefs to the cubes (datasets) in that hydra. For example, the hydra covid19case is translated into fi, en and sv, so it has 3 cubes.

thlHydra(subject, hydra = "covid19case")

Development thoughts:

It is a bit unclear what should be the starting point for the user. Is it an url to a cube? Is it an URL to a hydra? Is it the name of the subject and hydra?
thlDatasets should probably take as input the subject name (i.e. call thlSubject() internally first)

Main functionality source file location: retrieve.R

Cube dimensions: labels and metadata

Dimensions

A single dimension in the cube is a hierarchical structure with multiple stages. An example is Area with stages hospital district (stage 1) and municipality (stage 2).

thlCube object

the thlCube function now returns an object which includes

url: the url of the cube (fact url), which is also the argument of thlCube
dimensions: A list of data frames which include the complete hierarchy of each of the dimensions in the dataset (including also measures).

library(openthl)
urls <- thlHydra("toitu", "ennakko3")
url <- urls[1, 1] # fi href
cube <- openthl::thlCube(url)
names(cube$dimensions)

Dimensions object

The hierarchical dimension information is presented as a wide format data frame. The prefix in the column names indicates the stage. There is a single row per a unique label in the highest hierarchy level. In the example below the highest hierarchy level is municipality, so there are 311 rows (number of municipalities) in the dimension data.frame.

str(cube$dimensions[[1]]) # first dimension

Dimension retrieval and parsing

The function get_dimensions queries the API for dimension information.

url <- "https://sampo.thl.fi/pivot/prod/en/epirapo/covid19case/fact_epirapo_covid19case.json"
dimensions <- openthl:::get_dimensions(url) # list

The function parse_dimensions() parses all dimensions as a list of data frames (each with S3 class 'hydra_dimension_df')

# parse all dimensions as a list of data frames (each with S3 class 'hydra_dimension_df')
DF <- openthl:::parse_dimensions(dimensions)
names(DF)

parse_dimensions() uses getHierarchy(), which parses a single dimension as a data frame.

# parse a single dimension as a data frame
df <- openthl:::getHierarchy(dimensions$children[[1]], parent_id = dimensions$id[[1]])
str(df)

Data retrieval

Backend

Data retriaval and parsing is largely unimplemented. The general data retriaval function with type 'data' should be tested and parsing implemented.

User interface

Methods could be written which utilise the object returned by openthl::thlCube, which contains the dimension meta informantion, to build queries which

retrieve data
adds labels to the data

For example the following methods could be implemented:

select() (builds a query which chooses dimensions/measures)
filter() (adds a filter to the query)
collect() (retrieves the data)

Some choices need to be made regarding how the user refers to the dimensions and measures, i.e. whether to use ID's or labels. It may make sense to use ID's when select():ing. A stage could be referred to by for example .. This needs some consideration and exploration on what is most straight forward.

rOpenGov/openthl documentation built on Dec. 13, 2020, 5:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rOpenGov/openthl
R Tools for THL Open Data API

In rOpenGov/openthl: R Tools for THL Open Data API

About

API queries and general data retrieval

API data types

General API queries

API URL's

API exploration and user starting point

API terminology

User interaction with the terminology

Cube dimensions: labels and metadata

Dimensions

thlCube object

Dimensions object

Dimension retrieval and parsing

Data retrieval

Backend

User interface

R Package Documentation

Browse R Packages

We want your feedback!

rOpenGov/openthl R Tools for THL Open Data API

In rOpenGov/openthl: R Tools for THL Open Data API

About

API queries and general data retrieval

API data types

General API queries

API URL's

API exploration and user starting point

API terminology

User interaction with the terminology

Cube dimensions: labels and metadata

Dimensions

thlCube object

Dimensions object

Dimension retrieval and parsing

Data retrieval

Backend

User interface

R Package Documentation

Browse R Packages

We want your feedback!

rOpenGov/openthl
R Tools for THL Open Data API