eurostat-package: R Tools for Eurostat open data

eurostat-packageR Documentation

R Tools for Eurostat open data

Description

Tools to download data from the Eurostat database https://ec.europa.eu/eurostat together with search and manipulation utilities.

Details

Package eurostat
Type Package
Version 4.0.0
Date 2014-2023
License BSD_2_clause + file LICENSE
LazyLoad yes

Eurostat

Eurostat website: https://ec.europa.eu/eurostat Eurostat database: https://ec.europa.eu/eurostat/web/main/data/database

Information about the data update schedule from Eurostat: "Eurostat datasets are updated twice a day at 11:00 and 23:00 CET, if newer data is available or for structural changes, for example for the dimensions in the dataset.

The Eurostat database always contains the latest version of the datasets, meaning that there is no versioning or documentation of past versions of the data."

Data source: Eurostat SDMX 2.1 Dissemination API

Data is downloaded from Eurostat SDMX 2.1 API endpoint as compressed TSV files that are transformed into tabular format. See Eurostat documentation for more information: https://wikis.ec.europa.eu/display/EUROSTATHELP/API+SDMX+2.1+-+data+query

The new dissemination API replaces the old bulk download facility that was used by Eurostat before October 2023 and by the eurostat R package versions before 4.0.0. See Eurostat documentation about the transition from Bulk Download to API for more information about the differences between the old bulk download facility and the data provided by the new API connection: https://wikis.ec.europa.eu/display/EUROSTATHELP/Transition+-+from+Eurostat+Bulk+Download+to+API

See especially the document Migrating_to_API_TSV.pdf that describes the changes in TSV file format in new applications.

For more information about SDMX 2.1, see SDMX standards: Section 7: Guidelines for the use of web services, Version 2.1: https://sdmx.org/wp-content/uploads/SDMX_2-1_SECTION_7_WebServicesGuidelines.pdf

Disclaimer: Availability of filtering functionalities

Currently it only possible to download filtered data through API Statistics (JSON API) when using eurostat package, although technically filtering datasets downloaded through the SDMX Dissemination API is also supported by Eurostat. We may support this feature in the future. In the meantime, if you are interested in filtering Dissemination API data queries manually, please consult the following Eurostat documentation: https://wikis.ec.europa.eu/display/EUROSTATHELP/API+SDMX+2.1+-+data+filtering

Data source: Eurostat API Statistics (JSON API)

Data is downloaded from Eurostat API Statistics. See Eurostat documentation for more information about data queries in API Statistics https://wikis.ec.europa.eu/display/EUROSTATHELP/API+Statistics+-+data+query

This replaces the old JSON Web Services that was used by Eurostat before February 2023 and by the eurostat R package versions before 3.7.13. See Eurostat documentation about the migration from JSON web service to API Statistics for more information about the differences between the old and the new service: https://wikis.ec.europa.eu/display/EUROSTATHELP/API+Statistics+-+migrating+from+JSON+web+service+to+API+Statistics

For easily viewing which filtering options are available - in addition to the default ones, time and language - Eurostat Web services Query builder tool may be useful: https://ec.europa.eu/eurostat/web/query-builder

Filtering datasets

When using Eurostat API Statistics (JSON API), datasets can be filtered before they are downloaded and saved in local memory. The general format for filter parameters is ⁠<DIMENSION_CODE>=<VALUE>⁠.

Filter parameters are optional but the used dimension codes must be present in the data product that is being queried. Dimension codes can vary between different data products so it may be useful to examine new datasets in Eurostat data browser beforehand. However, most if not all Eurostat datasets concern European countries and contain information that was gathered at some point in time, so geo and time dimension codes can usually be used.

⁠<DIMENSION_CODE>⁠ and ⁠<VALUE>⁠ are case-insensitive and they can be written in lowercase or uppercase in the query.

Parameters are passed onto the eurostat package functions get_eurostat() and get_eurostat_json() as a list item. If an individual item contains multiple items, as it often can be in the case of geo parameters and other optional items, they must be in the form of a vector: c("FI", "SE"). For examples on how to use these parameters, see function examples below.

Time parameters

time and time_period address the same TIME_PERIOD dimension in the dataset and can be used interchangeably. In the Eurostat documentation it is stated that "Using more than one Time parameter in the same query is not accepted", but practice has shown that actually Eurostat API allows multiple time parameters in the same query. This makes it possible to use R colon operator when writing queries, so time = c(2015:2018) translates to ⁠&time=2015&time=2016&time=2017&time=2018⁠.

The only exception to this is when the queried dataset contains e.g. quarterly data and TIME_PERIOD is saved as 2015-Q1, 2015-Q2 etc. Then it is possible to use time=2015-Q1&time=2015-Q2 style in the query URL, but this makes it unfeasible to use the colon operator and requires a lot of manual typing.

Because of this, it is useful to know about other time parameters as well:

  • untilTimePeriod: return dataset items from the oldest record up until the set time, for example "all data until 2000": untilTimePeriod = 2000

  • sinceTimePeriod: return dataset items starting from set time, for example "all datastarting from 2008": sinceTimePeriod = 2008

  • lastTimePeriod: starting from the most recent time period, how many preceding time periods should be returned? For example 10 most recent observations: lastTimePeriod = 10

Using both untilTimePeriod and sinceTimePeriod parameters in the same query is allowed, making the usage of the R colon operator unnecessary. In the case of quarterly data, using untilTimePeriod and sinceTimePeriod parameters also works, as opposed to the colon operator, so it is generally safer to use them as well.

Other dimensions

In get_eurostat_json() examples nama_10_gdp dataset is filtered with two additional filter parameters:

  • na_item = "B1GQ"

  • unit = "CLV_I10"

Filters like these are most likely unique to the nama_10_gdp dataset (or other datasets within the same domain) and should not be used with others dataset without user discretion. By using label_eurostat() we know that "B1GQ" stands for "Gross domestic product at market prices" and "CLV_I10" means "Chain linked volumes, index 2010=100".

Different dimension codes can be translated to a natural language by using the get_eurostat_dic() function, which returns labels for individual dimension items such as na_item and unit, as opposed to label_eurostat() which does it for whole datasets. For example, the parameter na_item stands for "National accounts indicator (ESA 2010)" and unit stands for "Unit of measure".

Language

All datasets have metadata available in English, French and German. If no parameter is given, the labels are returned in English.

Example:

  • lang = "fr"

More information

For more information about data filtering see Eurostat documentation on API Statistics: https://wikis.ec.europa.eu/display/EUROSTATHELP/API+Statistics+-+data+query#APIStatisticsdataquery-TheparametersdefinedintheRESTrequest

Data source: Eurostat Table of Contents

The Eurostat Table of Contents (TOC) is downloaded from https://ec.europa.eu/eurostat/api/dissemination/catalogue/toc/txt?lang=en (default) or from French or German language variants: https://ec.europa.eu/eurostat/api/dissemination/catalogue/toc/txt?lang=fr https://ec.europa.eu/eurostat/api/dissemination/catalogue/toc/txt?lang=de

See Eurostat documentation on TOC items: https://wikis.ec.europa.eu/display/EUROSTATHELP/API+-+Detailed+guidelines+-+Catalogue+API+-+TOC

Data source: GISCO - General Copyright

"Eurostat's general copyright notice and licence policy is applicable and can be consulted here: https://ec.europa.eu/eurostat/about-us/policies/copyright

Please also be aware of the European Commission's general conditions: https://commission.europa.eu/legal-notice_en

Moreover, there are specific provisions applicable to some of the following datasets available for downloading. The download and usage of these data is subject to their acceptance:

  • Administrative Units / Statistical Units

  • Population distribution / Demography

  • Transport Networks

  • Land Cover

  • Elevation (DEM)"

Of the abovementioned datasets, Administrative Units / Statistical Units is applicable if the user wants to draw maps with borders provided by GISCO / EuroGeographics.

Data source: GISCO - Administrative Units / Statistical Units

The following copyright notice is provided for end user convenience. Please check up-to-date copyright information from the GISCO website: GISCO: Geographical information and maps - Administrative units/statistical units

"In addition to the general copyright and licence policy applicable to the whole Eurostat website, the following specific provisions apply to the datasets you are downloading. The download and usage of these data is subject to the acceptance of the following clauses:

  1. The Commission agrees to grant the non-exclusive and not transferable right to use and process the Eurostat/GISCO geographical data downloaded from this page (the "data").

  2. The permission to use the data is granted on condition that:

    1. the data will not be used for commercial purposes;

    2. the source will be acknowledged. A copyright notice, as specified below, will have to be visible on any printed or electronic publication using the data downloaded from this page."

Copyright notice

When data downloaded from this page is used in any printed or electronic publication, in addition to any other provisions applicable to the whole Eurostat website, data source will have to be acknowledged in the legend of the map and in the introductory page of the publication with the following copyright notice:

EN: © EuroGeographics for the administrative boundaries

FR: © EuroGeographics pour les limites administratives

DE: © EuroGeographics bezüglich der Verwaltungsgrenzen

For publications in languages other than English, French or German, the translation of the copyright notice in the language of the publication shall be used.

If you intend to use the data commercially, please contact EuroGeographics for information regarding their licence agreements."

Eurostat: Copyright notice and free re-use of data

The following copyright notice is provided for end user convenience. Please check up-to-date copyright information from the eurostat website: https://ec.europa.eu/eurostat/about-us/policies/copyright

"(c) European Union, 1995 - today

Eurostat has a policy of encouraging free re-use of its data, both for non-commercial and commercial purposes. All statistical data, metadata, content of web pages or other dissemination tools, official publications and other documents published on its website, with the exceptions listed below, can be reused without any payment or written licence provided that:

  • the source is indicated as Eurostat;

  • when re-use involves modifications to the data or text, this must be stated clearly to the end user of the information."

For exceptions to the abovementioned principles see Eurostat website

Citing Eurostat data

For citing datasets, use get_bibentry() to build a bibliography that is suitable for your reference manager of choice.

When using Eurostat data in other contexts than academic publications that in-text citations or footnotes/endnotes, the following guidelines may be helpful:

  • The origin of the data should always be mentioned as "Source: Eurostat".

  • The online dataset codes(s) should also be provided in order to ensure transparency and facilitate access to the Eurostat data and related methodological information. For example: "Source: Eurostat (online data code: namq_10_gdp)"

  • Online publications (e.g. web pages, PDF) should include a clickable link to the dataset using the bookmark functionality available in the Eurostat data browser.

It should be avoided to associate different entities (e.g. Eurostat, National Statistical Offices, other data providers) to the same dataset or indicator without specifying the role of each of them in the treatment of data.

See also section "Eurostat: Copyright notice and free re-use of data" in get_eurostat() documentation.

Strategies for handling large datasets more efficiently

Most Eurostat datasets are relatively manageable, at least on a machine with 16 GB of RAM. The largest dataset in Eurostat database, at the time of writing this, had 148362539 (148 million) values, which results in an object with 148 million rows in tidy data (long) format. The test machine with 16 GB of RAM was able to handle the second largest dataset in the database with 91 million values (rows).

There are still some methods to make data fetching functions perform faster:

  • turn caching off: get_eurostat(cache = FALSE)

  • turn cache compression off (may result in rather large cache files!): get_eurostat(compress_file = FALSE)

  • if you want faster caching with manageable file sizes, use stringsAsFactors: get_eurostat(cache = TRUE, compress_file = TRUE, stringsAsFactors = TRUE)

  • Use faster data.table functions: get_eurostat(use.data.table = TRUE)

  • Keep column processing to a minimum: get_eurostat(time_format = "raw", type = "code") etc.

  • Read get_eurostat() function documentation carefully so you understand what different arguments do

  • Filter the dataset so that you fetch only the parts you need!

regions functions

For working with sub-national statistics the basic functions of the regions package are imported https://regions.dataobservatory.eu/.

Author(s)

Leo Lahti, Janne Huovari, Markus Kainu, Przemyslaw Biecek

References

See citation("eurostat"):

Kindly cite the eurostat R package as follows:

  Lahti L., Huovari J., Kainu M., and Biecek P. (2017). Retrieval and
  analysis of Eurostat open data with the eurostat package. The R
  Journal 9(1), pp. 385-392. doi: 10.32614/RJ-2017-019

A BibTeX entry for LaTeX users is

  @Article{10.32614/RJ-2017-019,
    title = {Retrieval and Analysis of Eurostat Open Data with the eurostat Package},
    author = {Leo Lahti and Janne Huovari and Markus Kainu and Przemyslaw Biecek},
    journal = {The R Journal},
    volume = {9},
    number = {1},
    pages = {385--392},
    year = {2017},
    doi = {10.32614/RJ-2017-019},
    url = {https://doi.org/10.32614/RJ-2017-019},
  }

  Lahti, L., Huovari J., Kainu M., Biecek P., Hernangomez D., Antal D.,
  and Kantanen P. (2023). eurostat: Tools for Eurostat Open Data
  [Computer software]. R package version 4.0.0.
  https://github.com/rOpenGov/eurostat

A BibTeX entry for LaTeX users is

  @Misc{eurostat,
    title = {eurostat: Tools for Eurostat Open Data},
    author = {Leo Lahti and Janne Huovari and Markus Kainu and Przemyslaw Biecek and Diego Hernangomez and Daniel Antal and Pyry Kantanen},
    url = {https://github.com/rOpenGov/eurostat},
    type = {Computer software},
    year = {2023},
    note = {R package version 4.0.0},
  }

When citing data downloaded from Eurostat, see section "Citing Eurostat data" in get_eurostat() documentation.

See Also

help("regions"), https://regions.dataobservatory.eu/

Examples

library(eurostat)

ropengov/eurostat documentation built on Jan. 19, 2024, 8:10 p.m.