knitr::opts_chunk$set(
  collapse = TRUE, eval=FALSE,
  comment = "#>"
)

In this example, we will curate Eurostat indicators. All indicators are national indicators, we are not doing geographical recoding with regional indicators.

library(indicators)
require(eurostat)
require(dplyr)

Curation Process

Eurostat publishes data folders. We select only the indicators that are directly relevant to our observatory, in this case, the music observatory.

#cloud computer use
  isoc_cicci_use <- eurostat::get_eurostat(id = "isoc_cicci_use")
  cloud_personal_raw <- eurostat::get_eurostat(id = "isoc_cicci_use") %>%
    filter ( .data$ind_type %in% c("STUD", "Y16_24", "IND_TOTAL"),
             .data$indic_is %in% c("I_CC", "I_CCS_CC", "I_CC_PAY"),
             .data$unit %in% c("PC_IND"))

# Frequency of practicing of artistic activities by sex, age and educational attainment level[ilc_scp07]
ilc_scp07 <- eurostat::get_eurostat(id = "ilc_scp07")
  artistic_activity_raw <- ilc_scp07 %>%
    filter ( .data$isced11 == "TOTAL",
             .data$age %in% c("Y_GE16", "Y16-24"))

# Final consumption expenditure of households by consumption purpose (COICOP 3 digit)[nama_10_co3_p3]
nama_10_co3_p3 <- eurostat::get_eurostat(id = "nama_10_co3_p3")

household_consumption_raw <- nama_10_co3_p3  %>%
    filter (.data$unit == "PC_TOT",
            .data$coicop %in% c("CP09", "CP091"))

The preselection is important, because the imputation code contains many inefficiencies, and many unit tests. It is resource-intensive.

Tidy Indicators {#tidy-indicators}

The eurostat warehouse folder contains data in various formats. They are tidy (all observations are in rows, and all variables are in columns) but the columns vary product to product. To place it into our observatory database, we create a canonical form of the indicator with get_eurostat_indicator().

A similar function must be written to all major open data sources.

The id, for example, id = "isoc_cicci_use" is used to identify the relevant Eurostat dictionary to describe and label the data. If the preselected_indicators = NULL then it will try to download the id = "isoc_cicci_use" product from Eurostat. Because some Eurostat folders are very huge, it is unlikely that all functions will work on entire folders. So some curation is necessary, which we did above.

cloud_indicators <- get_eurostat_indicator(
  preselected_indicators = cloud_personal_raw,
  id = "isoc_cicci_use")

artistic_activity_indicators <- get_eurostat_indicator ( 
  artistic_activity_raw, 
  id = "ilc_scp07")

household_consumption_indicators <- get_eurostat_indicator (  
  household_consumption_raw, 
  id = "nama_10_co3_p3")

The function returns a list with four tables.

Indicator values

  artistic_activity_indicators$indicator %>% 
  head() %>%
  select ( .data$geo, .data$time, .data$value, .data$estimate, .data$shortcode )

Indicator labelling

The following labels were found in the curated data (we do not reproduce the entire Eurostat dictionary, only the items that we actually use.)

  artistic_activity_indicators$labelling 

Indicator metadata

The metadata table contains many important information about the indicator. Some of this data will be later used to identify which indicators need to be refreshed from source, and reprocessed.

artistic_activity_indicators$metadata %>%
  filter (.data$shortcode == unique(.data$shortcode[1])) %>%
  select ( all_of(c("shortcode", "actual", "missing", 
                    "data_start", "last_update_at_source"))
  )

Imputation

Currently we have some imputation functions that we apply on our example indicators.

This is a very resource intensive step, may take long.

  1. Approximate missing values withing time series with na_approx();
  2. Next observation carry forward for old missing values na_nocb();
  3. Forecast the time series ahead indicator_forecast();
  4. If the forecast did not work, try last observation carry forward na_locf().

The backcasting is not yet implemented.

# We will estimate the missing values with variuos imputation methods.
indicators_to_impute <- cloud_indicators$indicator %>%
    bind_rows ( artistic_activity_indicators$indicator ) %>%
    bind_rows ( household_consumption_indicators$indicator)

# We will updated the estimation columns after imputation.
dmo_metadata_to_update <- cloud_indicators$metadata %>%
    bind_rows ( artistic_activity_indicators$metadata )  %>%
    bind_rows ( household_consumption_indicators$metadta)

# We need the labels, too, but we won't do anything with them.
dmo_labelling_table <- cloud_indicators$labelling %>%
    bind_rows ( artistic_activity_indicators$labelling )  %>%
    bind_rows ( household_consumption_indicators$labelling )
set.seed(2001)
dmo_labelling_table %>% 
  sample_n(5)

And then carry out the imputation. In case you have many indicators, maybe it is safer to do them subgroup by subgroup. Later, when we will anyway re-process what is changed at the source, this may not be a problem.

dmo_imputed_indicator <- impute_indicators ( 
  indic = indicators_to_impute
  )

The full description of the indicator table with an example is the small_population_indicator dataset.

Update the metadata

This is not yet unit-tested, and does not seem to work properly. Basically we count the number of approximated, next observation carry forward, forecasted, and last observation carry forward estimates.

dmo_updated_metadata <- update_metadata(
  dmo_imputed_indicator, 
  metadata = dmo_metadata_to_update )

The full description of the metadata table with an example is the small_population_metadata dataset.

They should show up here, but it suspicious that some are gone missing:

set.seed(2021)
dmo_updated_metadata %>%
  sample_n (12) %>%
  select (all_of (c("shortcode", "actual",
                    "missing", "nocb", "locf",
                    "approximate", "forecast"))
  )

Descriptive metadata

artistic_activity_indicators$description

Map to Observatory

At last, we create keywords. The keywords help us placing the indicators and their metadata in the long-form documentation. At least four keywords must be used. The first keyword, "music", identifies the music observatory, the "economy" the Music economy pillar, "Demand" is the first top-level division in the pillar, and "PCR"is the second. Any further keywords, if they exist, are added as a concatenated list to the keyword table, divided by __.

dmo_keywords <- add_keywords (
  description_table = artistic_activity_indicators$description,
  keywords = list( "music", "economy", "supply", "potential_supply"), 
  description = NULL # other functionality not yet implemented
  ) %>%
  bind_rows ( 
    add_keywords (
      cloud_indicators$description, 
      list( "music", "economy", "demand", "pcr")) 
    ) %>%
  bind_rows ( add_keywords (
    household_consumption_indicators$description,
    list( "music", "economy", "demand", "general")) 
    )

set.seed(2021) #fixed pseudo-random selection
dmo_keywords %>%
  sample_n(12)

The full description of the descriptive metadata table with an example is the small_population_description dataset.

Create the Database

dmo_path <- ifelse ( 
  dir.exists("data-raw"), 
  yes = file.path("data-raw", "dmo.db"), 
  no = file.path("..", "data-raw", "dmo.db"))

create_database ( 
  indicator_table = dmo_imputed_indicator,
  metadata_table= dmo_updated_metadata,
  labelling_table = dmo_labelling_table,
  description_table = dmo_keywords,
  db_path = dmo_path
  )
disc_con <- RSQLite::dbConnect(RSQLite::SQLite(), dmo_path  )
DBI::dbListTables(disc_con)
names(DBI::dbReadTable(disc_con, "description"))
names(DBI::dbReadTable(disc_con, "indicator"))
names(DBI::dbReadTable(disc_con, "labelling"))


dataobservatory-eu/indicator documentation built on Dec. 19, 2021, 8:13 p.m.