knitr::opts_chunk$set( collapse = TRUE, eval=FALSE, comment = "#>" )
In this example, we will curate Eurostat indicators. All indicators are national
indicators, we are not doing geographical recoding with regional indicators.
library(indicators) require(eurostat) require(dplyr)
Eurostat publishes data folders
. We select only the indicators that are directly relevant to our observatory, in this case, the music observatory.
#cloud computer use isoc_cicci_use <- eurostat::get_eurostat(id = "isoc_cicci_use") cloud_personal_raw <- eurostat::get_eurostat(id = "isoc_cicci_use") %>% filter ( .data$ind_type %in% c("STUD", "Y16_24", "IND_TOTAL"), .data$indic_is %in% c("I_CC", "I_CCS_CC", "I_CC_PAY"), .data$unit %in% c("PC_IND")) # Frequency of practicing of artistic activities by sex, age and educational attainment level[ilc_scp07] ilc_scp07 <- eurostat::get_eurostat(id = "ilc_scp07") artistic_activity_raw <- ilc_scp07 %>% filter ( .data$isced11 == "TOTAL", .data$age %in% c("Y_GE16", "Y16-24")) # Final consumption expenditure of households by consumption purpose (COICOP 3 digit)[nama_10_co3_p3] nama_10_co3_p3 <- eurostat::get_eurostat(id = "nama_10_co3_p3") household_consumption_raw <- nama_10_co3_p3 %>% filter (.data$unit == "PC_TOT", .data$coicop %in% c("CP09", "CP091"))
The preselection is important, because the imputation code contains many inefficiencies, and many unit tests. It is resource-intensive.
The eurostat warehouse folder contains data in various formats. They are tidy (all observations are in rows, and all variables are in columns) but the columns vary product to product. To place it into our observatory database, we create a canonical form of the indicator with get_eurostat_indicator()
.
A similar function must be written to all major open data sources.
The id
, for example, id = "isoc_cicci_use"
is used to identify the relevant Eurostat dictionary to describe and label the data. If the preselected_indicators = NULL
then it will try to download the id = "isoc_cicci_use"
product from Eurostat. Because some Eurostat folders are very huge, it is unlikely that all functions will work on entire folders. So some curation is necessary, which we did above.
cloud_indicators <- get_eurostat_indicator( preselected_indicators = cloud_personal_raw, id = "isoc_cicci_use") artistic_activity_indicators <- get_eurostat_indicator ( artistic_activity_raw, id = "ilc_scp07") household_consumption_indicators <- get_eurostat_indicator ( household_consumption_raw, id = "nama_10_co3_p3")
The function returns a list with four tables.
artistic_activity_indicators$indicator %>% head() %>% select ( .data$geo, .data$time, .data$value, .data$estimate, .data$shortcode )
The following labels were found in the curated data (we do not reproduce the entire Eurostat dictionary, only the items that we actually use.)
artistic_activity_indicators$labelling
The metadata table contains many important information about the indicator. Some of this data will be later used to identify which indicators need to be refreshed from source, and reprocessed.
artistic_activity_indicators$metadata %>% filter (.data$shortcode == unique(.data$shortcode[1])) %>% select ( all_of(c("shortcode", "actual", "missing", "data_start", "last_update_at_source")) )
Currently we have some imputation functions that we apply on our example indicators.
This is a very resource intensive step, may take long.
na_approx()
;na_nocb()
;indicator_forecast()
;na_locf()
.The backcasting is not yet implemented.
# We will estimate the missing values with variuos imputation methods. indicators_to_impute <- cloud_indicators$indicator %>% bind_rows ( artistic_activity_indicators$indicator ) %>% bind_rows ( household_consumption_indicators$indicator) # We will updated the estimation columns after imputation. dmo_metadata_to_update <- cloud_indicators$metadata %>% bind_rows ( artistic_activity_indicators$metadata ) %>% bind_rows ( household_consumption_indicators$metadta) # We need the labels, too, but we won't do anything with them. dmo_labelling_table <- cloud_indicators$labelling %>% bind_rows ( artistic_activity_indicators$labelling ) %>% bind_rows ( household_consumption_indicators$labelling )
set.seed(2001) dmo_labelling_table %>% sample_n(5)
And then carry out the imputation. In case you have many indicators, maybe it is safer to do them subgroup by subgroup. Later, when we will anyway re-process what is changed at the source, this may not be a problem.
dmo_imputed_indicator <- impute_indicators ( indic = indicators_to_impute )
The full description of the indicator table with an example is the small_population_indicator dataset.
This is not yet unit-tested, and does not seem to work properly. Basically we count the number of approximated
, next observation carry forward
, forecasted
, and last observation carry forward
estimates.
dmo_updated_metadata <- update_metadata( dmo_imputed_indicator, metadata = dmo_metadata_to_update )
The full description of the metadata table with an example is the small_population_metadata dataset.
They should show up here, but it suspicious that some are gone missing:
set.seed(2021) dmo_updated_metadata %>% sample_n (12) %>% select (all_of (c("shortcode", "actual", "missing", "nocb", "locf", "approximate", "forecast")) )
artistic_activity_indicators$description
At last, we create keywords. The keywords help us placing the indicators and their metadata in the long-form documentation. At least four keywords must be used. The first keyword, "music", identifies the music observatory, the "economy" the Music economy pillar, "Demand" is the first top-level division in the pillar, and "PCR"is the second. Any further keywords, if they exist, are added as a concatenated list to the keyword table, divided by __
.
dmo_keywords <- add_keywords ( description_table = artistic_activity_indicators$description, keywords = list( "music", "economy", "supply", "potential_supply"), description = NULL # other functionality not yet implemented ) %>% bind_rows ( add_keywords ( cloud_indicators$description, list( "music", "economy", "demand", "pcr")) ) %>% bind_rows ( add_keywords ( household_consumption_indicators$description, list( "music", "economy", "demand", "general")) ) set.seed(2021) #fixed pseudo-random selection dmo_keywords %>% sample_n(12)
The full description of the descriptive metadata table with an example is the small_population_description dataset.
dmo_path <- ifelse ( dir.exists("data-raw"), yes = file.path("data-raw", "dmo.db"), no = file.path("..", "data-raw", "dmo.db")) create_database ( indicator_table = dmo_imputed_indicator, metadata_table= dmo_updated_metadata, labelling_table = dmo_labelling_table, description_table = dmo_keywords, db_path = dmo_path )
disc_con <- RSQLite::dbConnect(RSQLite::SQLite(), dmo_path ) DBI::dbListTables(disc_con)
names(DBI::dbReadTable(disc_con, "description"))
names(DBI::dbReadTable(disc_con, "indicator"))
names(DBI::dbReadTable(disc_con, "labelling"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.