R/get_eurostat.R

Defines functions get_eurostat

Documented in get_eurostat

#' @title Read Eurostat Data
#' 
#' @description 
#' Download data sets from Eurostat \url{https://ec.europa.eu/eurostat}
#'
#' @param id 
#' A code name for the dataset of interest.
#' See [search_eurostat()] or details for how to get code.
#' @param filters a "none" (default) to get a whole dataset or a named list of
#' filters to get just part of the table. Names of list objects are
#' Eurostat variable codes and values are vectors of observation codes.
#' If `NULL` the whole
#' dataset is returned via API. More on details. See more on filters and
#' limitations per query via API from for
#' [get_eurostat_json()].
#' @param time_format 
#' a string giving a type of the conversion of the time
#' column from the eurostat format. A "date" (default) converts to
#' a [Date()] with a first date of the period.
#' A "date_last" converts to a [Date()] with
#' a last date of the period. A "num" converts to a numeric and "raw"
#' does not do conversion. See [eurotime2date()] and
#' [eurotime2num()].
#' @param type 
#' A type of variables, "code" (default) or "label".
#' @param select_time 
#' a character symbol for a time frequency or NULL,
#' which is used by default as most datasets have just one time
#' frequency. For datasets with multiple time
#' frequencies, select one or more of the desired frequencies with:
#' "Y" (or "A") = annual, "S" = semi-annual / semester, "Q" = quarterly, 
#' "M" = monthly, "W" = weekly. For all frequencies in same data 
#' frame `time_format = "raw"` should be used.
#' @param cache 
#' a logical whether to do caching. Default is `TRUE`. Affects
#' only queries from the bulk download facility.
#' @param update_cache 
#' a logical whether to update cache. Can be set also with
#' options(eurostat_update = TRUE)
#' @param cache_dir a
#' path to a cache directory. The directory must exist.
#' The `NULL` (default) uses and creates
#' 'eurostat' directory in the temporary directory from
#' [tempdir()]. The directory can also be set with
#' [set_eurostat_cache_dir()].
#' @param compress_file 
#' a logical whether to compress the
#' RDS-file in caching. Default is `TRUE`.
#' @param stringsAsFactors 
#' if `FALSE` (the default) the variables are
#' returned as characters. If `TRUE` the variables are converted to 
#' factors in original Eurostat order.
#' @param keepFlags 
#' a logical whether the flags (e.g. "confidential",
#' "provisional") should be kept in a separate column or if they
#' can be removed. Default is `FALSE`. For flag values see:
#' <https://ec.europa.eu/eurostat/data/database/information>.
#' Also possible non-real zero "0n" is indicated in flags column.
#' Flags are not available for eurostat API, so `keepFlags`
#' can not be used with a `filters`.
#' @param legacy_bulk_download 
#' a logical, whether to use the new dissemination API to
#' download TSV files instead of the old Bulk Download facilities.
#' Default is `TRUE`. This is a temporary parameter that will be deleted 
#' after the old Bulk Download facilities will are decommissioned. Please
#' use caution if you intend to build any automated scripts that use this
#' parameter.
#' @inheritDotParams get_eurostat_json
#' @export
#' @references
#' See `citation("eurostat")`:
#'
#' ```{r, echo=FALSE, comment="#" }
#' citation("eurostat")
#' ```
#' 
#' When citing data, please indicate that the data source is Eurostat. If the
#' re-use of data involves modification to the data or text, state this clearly.
#' For more detailed information and exceptions regarding commercial use,
#' see [Eurostat policy on copyright and free re-use of data](https://ec.europa.eu/eurostat/web/main/about/policies/copyright).
#'
#' @author Przemyslaw Biecek, Leo Lahti, Janne Huovari and Markus Kainu
#' @details Data sets are downloaded from
#' [the Eurostat bulk download facility](https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing) or from The Eurostat Web Services
#' [JSON API](https://ec.europa.eu/eurostat/web/main/data/web-services).
#' If only the table `id` is given, the whole table is downloaded from the
#' bulk download facility. If also `filters` are defined the JSON API is
#' used.
#'
#' The bulk download facility is the fastest method to download whole datasets.
#' It is also often the only way as the JSON API has limitation of maximum
#' 50 sub-indicators at time and whole datasets usually exceeds that. Also,
#' it seems that multi frequency datasets can only be retrieved via
#' bulk download facility and the `select_time` is not available for
#' JSON API method.
#'
#' If your connection is thru a proxy, you probably have to set proxy parameters
#' to use JSON API, see [get_eurostat_json()].
#'
#' By default datasets from the bulk download facility are cached as they are
#' often rather large. Caching is not (currently) possible for datasets from
#' JSON API.
#' Cache files are stored in a temporary directory by default or in
#' a named directory (See [set_eurostat_cache_dir()]).
#' The cache can be emptied with [clean_eurostat_cache()].
#'
#' The `id`, a code, for the dataset can be searched with
#' the [search_eurostat()] or from the Eurostat database
#' <https://ec.europa.eu/eurostat/data/database>. The Eurostat
#' database gives codes in the Data Navigation Tree after every dataset
#' in parenthesis.
#' @return 
#' a tibble. 
#' 
#' One column for each dimension in the data, the time column for a time 
#' dimension and the values column for numerical values. Eurostat data does 
#' not include all missing values and a treatment of missing values depend 
#' on source. In bulk download facility missing values are dropped if all 
#' dimensions are missing on particular time. In JSON API missing values are 
#' dropped only if all dimensions are missing on all times. The data from
#' bulk download facility can be completed for example with [tidyr::complete()].
#' @seealso [search_eurostat()], [label_eurostat()]
#' @examplesIf check_access_to_data()
#' \dontrun{
#' k <- get_eurostat("nama_10_lp_ulc")
#' k <- get_eurostat("nama_10_lp_ulc", time_format = "num")
#' k <- get_eurostat("nama_10_lp_ulc", update_cache = TRUE)
#'
#' k <- get_eurostat("nama_10_lp_ulc",
#'   cache_dir = file.path(tempdir(), "r_cache")
#' )
#' options(eurostat_update = TRUE)
#' k <- get_eurostat("nama_10_lp_ulc")
#' options(eurostat_update = FALSE)
#'
#' set_eurostat_cache_dir(file.path(tempdir(), "r_cache2"))
#' k <- get_eurostat("nama_10_lp_ulc")
#' k <- get_eurostat("nama_10_lp_ulc", cache = FALSE)
#' k <- get_eurostat("avia_gonc", select_time = "Y", cache = FALSE)
#'
#' dd <- get_eurostat("nama_10_gdp",
#'   filters = list(
#'     geo = "FI",
#'     na_item = "B1GQ",
#'     unit = "CLV_I10"
#'   )
#' )
#' 
#' # A dataset with multiple time series in one
#' dd2 <- get_eurostat("AVIA_GOR_ME",
#'   select_time = c("A", "M", "Q"),
#'   time_format = "date_last",
#'   legacy_bulk_download = FALSE
#' )
#' }
#'
get_eurostat <- function(id, 
                         time_format = "date", 
                         filters = "none",
                         type = "code",
                         select_time = NULL,
                         cache = TRUE, 
                         update_cache = FALSE, 
                         cache_dir = NULL,
                         compress_file = TRUE,
                         stringsAsFactors = FALSE,
                         keepFlags = FALSE,
                         legacy_bulk_download = TRUE,
                         ...) {
  
  # Check if you have access to ec.europe.eu.
  if (!check_access_to_data()) {
    message("You have no access to ec.europe.eu.
      Please check your connection and/or review your proxy settings")
  } else {
    # Warning for flags with filter
    if (keepFlags & !is.character(filters) && filters != "none") {
      warning("The keepFlags argument of the get_eurostat function
               can be used only without filters. No Flags returned.")
    }
    
    # No cache for json
    if (is.null(filters) || identical(filters, "none")) {
      cache <- FALSE
    }
    
    if (cache) {
      
      # check option for update
      update_cache <- update_cache | getOption("eurostat_update", FALSE)
      
      # get cache directory
      cache_dir <- eur_helper_cachedir(cache_dir)
      
      # cache filename
      cache_file <- file.path(
        cache_dir,
        paste0(
          id, "_", time_format,
          "_", type, select_time, "_",
          strtrim(stringsAsFactors, 1),
          strtrim(keepFlags, 1),
          ".rds"
        )
      )
    }
    
    # if cache = FALSE or update or new: dowload else read from cache
    if (!cache || update_cache || !file.exists(cache_file)) {
      if (is.null(filters) || is.list(filters)) {
        
        # JSON API Download
        y <- get_eurostat_json(id, filters,
                               type = type,
                               stringsAsFactors = stringsAsFactors, ...
        )
        y$time <- convert_time_col(factor(y$time), time_format = time_format)
        
        # Bulk download
      } else if (filters == "none") {
        
        if (legacy_bulk_download == TRUE) {
          # Download from old bulk download facilities 
          # with old get_eurostat_raw function
          # This if-else construct is temporary until the old Bulk Download is
          # removed from use by Eurostat
          y_raw <- try(get_eurostat_raw(id))
          
          if ("try-error" %in% class(y_raw)) {
            stop(paste("get_eurostat_raw fails with the id", id, "\n"))
          }
          
          # If download from old bulk download facilities was successful
          # Then tidy the dataset with old tidy_eurostat function
          y <- tidy_eurostat(y_raw, 
                             time_format, 
                             select_time,
                             stringsAsFactors = stringsAsFactors,
                             keepFlags = keepFlags
          )
        } else {
          message("Trying to download from the new dissemination API... \n")
          # Download from new dissemination API in TSV file format
          y_raw <- try(get_eurostat_raw2(id))
          if ("try-error" %in% class(y_raw)) {
            stop(paste("get_eurostat_raw fails with the id", id))
          }
          # If download from new dissemination API is successful
          # Then tidy the dataset with the new tidy_eurostat2 function
          y <- tidy_eurostat2(y_raw, 
                              time_format, 
                              select_time,
                              stringsAsFactors = stringsAsFactors,
                              keepFlags = keepFlags
          )
        }
        
        if (type == "code") {
          y <- y
        } else if (type == "label" && legacy_bulk_download == TRUE) {
          y <- label_eurostat(y)
        } else if (type == "label" && legacy_bulk_download == FALSE){
          y <- label_eurostat2(y)
        } else if (type == "both") {
          stop("type = \"both\" can be only used with JSON API. Set filters argument")
        } else {
          stop("Invalid type.")
        }
      }
    } else {
      cf <- path.expand(cache_file)
      message(paste("Reading cache file", cf))
      y <- readRDS(cache_file)
      message(paste("Table ", id, " read from cache file: ", cf))
    }
    
    # if update or new: save
    if (cache && (update_cache || !file.exists(cache_file))) {
      saveRDS(y, file = cache_file, compress = compress_file)
      message("Table ", id, " cached at ", path.expand(cache_file))
    }
    
    y
  }
}

Try the eurostat package in your browser

Any scripts or data that you put into this service are public.

eurostat documentation built on March 7, 2023, 5:39 p.m.