library(rtauargus) library(dplyr)
Before using the functions that produce safe tables, it is necessary to analyze the tables that need to be protected. Indeed, the disseminated tables differ from the tables to be protected. This is explained in detail in Chapter 3 of the Manuel de protection des tableaux. This phase of analyzing common margins between different tables can be complex. Therefore, functions have been developed in rtauargus
to facilitate this work.
Metadata must adhere to a specific format. Each line represents a table, and for each table, the following must be specified:
table_name
: the name of the table ;field
: the field (often, geographical area, studied population, year of data collection) ;hrc_field
: whether there is a hierarchical link between this field and another field. If there is no hierarchical link, this field should be NA
. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables ;indicator
: the table's indicator name. This is the name of the variable that is counted in each cell of the table. For example, if my table presents the turnover of companies, my variable will likely be "Turnover". It is best to use the variable name as described in the data;hrc_indicator
: whether there is a hierarchical link between this indicator and another indicator. If there is no hierarchical link, this field should be NA. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables;spanning_1
: the name of the first cross-tabulation (spanning) variable. Additional columns are added for each cross-tabulation variable, following the naming convention: spanning_1
, spanning_2
, spanning_3
, etc ;hrc_spanning_1
: whether there is a hierarchical link between this cross-tabulation variable and another cross-tabulation variable. If there is no hierarchical link, this field should be NA. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables. Additional columns are added for each cross-tabulation variable, following the naming convention: hrc_spanning_1
, hrc_spanning_2
, hrc_spanning_3
, etc. Note: there must be as many spanning_X
columns as hrc_spanning_X
columns.When the tables to be processed are Eurostat tables, the metadata file can be automatically created from the template of the published cells. Indeed, for Eurostat publications, National Statistical Institutes (NSIs) must provide a template that groups the published cells.
To do this, it is essential to understand each column of the template. Determine for each one whether it is the response variable/indicator, a cross-tabulation variable, or a field, in order to specify this in the arguments of the function \ref{sec:args_template}. It is particularly important to verify that the template adheres to the following: one cross-tabulation variable = one column in the template, and thus one supertotal. Indeed, if a column groups multiple cross-tabulation variables, then as many columns as cross-tabulation variables must be created.
analyse_metadata
args(analyse_metadata)
## function (df_metadata, verbose = FALSE) ## NULL
The function arguments are as follows:
df_metadata
: a dataframe structured as specified in section \ref{sec:crea_meta}.verbose
: a variable of type logical
. If TRUE
he function returns a list with the different steps of the analysis; if FALSE
, it returns only the final dataframe with the cluster indicator.format_template
\label{sec:args_template}args(format_template)
## function (data, indicator_column, spanning_var_tot, field_columns) ## NULL
Les arguments de la fonction sont les suivants :
data
: dataframe du template Eurostat contenant toutes les cellules publiées.indicator_column
: nom de la colonne dans laquelle se trouvent les indicateurs.spanning_var_tot
: liste nommée des variables de croisements et de leurs totaux.field_columns
: vecteur de toutes les colonnes représentant des champs (ex : millésime).In this example, we start with a metadata file presenting 12 tables to be published on the turnover of pizza and salad sales.
str(metadata_pizza_lettuce)
## 'data.frame': 12 obs. of 9 variables: ## $ table_name : chr "T1" "T2" "T3" "T4" ... ## $ field : chr "france_entreprises_2023" "france_entreprises_2023" "france_entreprises_2023" "france_entreprises_2023" ... ## $ hrc_field : logi NA NA NA NA NA NA ... ## $ indicator : chr "to_pizza" "to_pizza" "to_pizza" "to_pizza" ... ## $ hrc_indicator : chr NA NA NA NA ... ## $ spanning_1 : chr "nuts2" "nuts3" "a10" "a10" ... ## $ hrc_spanning_1: chr "hrc_nuts" "hrc_nuts" "hrc_naf" "hrc_naf" ... ## $ spanning_2 : chr "size" "size" "nuts2" "nuts3" ... ## $ hrc_spanning_2: chr NA NA "hrc_nuts" "hrc_nuts" ...
Example code:
library(rtauargus) data(metadata_pizza_lettuce) # Analyse complète, avec les étapes detailed_analysis <- analyse_metadata(metadata_pizza_lettuce, verbose = TRUE) # Output simplifié, uniquement le dataframe avec l'indicatrice de cluster cluster_id_dataframe <- analyse_metadata(metadata_pizza_lettuce, verbose = FALSE)
The output is a dataframe showing how to process the tables for applying the secret.
cluster_id_dataframe
## # A tibble: 4 x 10 ## # Groups: table_name [4] ## cluster table_name field indicator spanning_1 spanning_2 spanning_3 hrc_spanning_1 hrc_spanning_2 hrc_spanning_3 ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 france_entrep~ T10.T12.T8 fran~ LETTUCE HRC_NAF cj HRC_LETTU~ hrc_naf <NA> hrc_lettuce ## 2 france_entrep~ T11.T7.T9 fran~ LETTUCE HRC_NAF size HRC_LETTU~ hrc_naf <NA> hrc_lettuce ## 3 france_entrep~ T1.T2 fran~ to_pizza HRC_NUTS size <NA> hrc_nuts <NA> <NA> ## 4 france_entrep~ T3.T4.T5.~ fran~ to_pizza HRC_NAF HRC_NUTS <NA> hrc_naf hrc_nuts <NA>
For the 12 tables to be published, it is sufficient to protect 4 tables. These tables are distributed across two different clusters. Therefore, tab_multi_manager()
needs to be called twice.
A new spanning variable appeared HRC_LETTUCE^h
. This variable takes 3 values batavia
, arugula
and lettuce
the margin. The indicator LETTUCE
here refers to the turnover (to) of salad sales. For example, the table T10.T12.T8
presents the revenue from salad sales based on the activity and category of the company, as well as the type of salad.
To obtain more information about the metadata analysis, the verbose = TRUE
argument can be specified. This way, the object returned by the function will be a list of the different steps of the analysis.
names(detailed_analysis)
## [1] "identify_hrc" "info_var" "split_in_clusters" "create_edges" "grp_tab_names" ## [6] "grp_tab_in_clusters" "tab_to_treat" "df_tab_to_treat"
One finds the dataframe with the cluster indicator df_tab_to_treat
. The result is the same but in list format: each element of the list is an independent cluster tab_to_treat
. Additionally, the 6 steps of the analysis are included.
The format_template()
function allows creating the metadata to be used as input for analyse_metadata()
from the Eurostat template of the published cells.
data(enterprise_template) str(enterprise_template)
## 'data.frame': 3168 obs. of 5 variables: ## $ TIME_PERIOD: int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ... ## $ INDICATOR : chr "SAL" "SAL" "SAL" "SAL" ... ## $ ACTIVITY : chr "B" "B" "B" "B" ... ## $ NUMBER_EMPL: chr "E0" "E1T4" "E5T9" "EGE10" ... ## $ LEGAL_FORM : chr "_T" "_T" "_T" "_T" ...
template_formatted <- format_template( data = enterprise_template, indicator_column = "INDICATOR", spanning_var_tot = list( ACTIVITY = "BTSXO_S94", NUMBER_EMPL = "_T", LEGAL_FORM = "_T"), field_columns = c("TIME_PERIOD") ) template_formatted$metadata
## table_name field indicator spanning_1 spanning_2 hrc_spanning_1 hrc_spanning_2 ## 1 table_2021_SAL_DTH_1 2021 SAL_DTH ACTIVITY LEGAL_FORM hrc_activity_131 hrc_legal_form_3 ## 2 table_2021_SAL_DTH_2 2021 SAL_DTH ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4 ## 3 table_2022_SAL_1 2022 SAL ACTIVITY LEGAL_FORM hrc_activity_131 hrc_legal_form_3 ## 4 table_2022_SAL_2 2022 SAL ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4 ## 5 table_2022_SAL_DTH_1 2022 SAL_DTH ACTIVITY LEGAL_FORM hrc_activity_131 hrc_legal_form_3 ## 6 table_2022_SAL_DTH_2 2022 SAL_DTH ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
There are 6 tables to be published.
Remark: format_template()
does not construct the hrc_indicator
column because the information contained in the template does not allow it to be determined. Therefore, one must create it based on the information available about the indicators.
Here, there is no hierarchical link between SAL
(employees of active companies) and SAL_DTH
(employees in deaths).
# cas où il n'y a aucune hiérarchie sur les indicateurs metadata_template <- template_formatted$metadata %>% mutate(hrc_indicator = NA) %>% select(table_name,field,indicator,hrc_indicator, everything()) metadata_template
## table_name field indicator hrc_indicator spanning_1 spanning_2 hrc_spanning_1 hrc_spanning_2 ## 1 table_2021_SAL_DTH_1 2021 SAL_DTH NA ACTIVITY LEGAL_FORM hrc_activity_131 hrc_legal_form_3 ## 2 table_2021_SAL_DTH_2 2021 SAL_DTH NA ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4 ## 3 table_2022_SAL_1 2022 SAL NA ACTIVITY LEGAL_FORM hrc_activity_131 hrc_legal_form_3 ## 4 table_2022_SAL_2 2022 SAL NA ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4 ## 5 table_2022_SAL_DTH_1 2022 SAL_DTH NA ACTIVITY LEGAL_FORM hrc_activity_131 hrc_legal_form_3 ## 6 table_2022_SAL_DTH_2 2022 SAL_DTH NA ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
Next, this dataframe is used as input for the analysis function.
# Analyse complète, avec les étapes detailed_analysis <- analyse_metadata(metadata_template, verbose = TRUE) # Output simplifié, uniquement le dataframe avec l'indicatrice de cluster cluster_id_dataframe <- analyse_metadata(metadata_template, verbose = FALSE) # visualisation du résultat de l'analyse cluster_id_dataframe
## # A tibble: 6 x 8 ## # Groups: table_name [6] ## cluster table_name field indicator spanning_1 spanning_2 hrc_spanning_1 hrc_spanning_2 ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 2021.SAL_DTH table_2021_SAL_DTH_1 2021 SAL_DTH HRC_ACTIVITY_131 HRC_LEGAL_FORM_3 hrc_activity_131 hrc_legal_form_3 ## 2 2021.SAL_DTH table_2021_SAL_DTH_2 2021 SAL_DTH HRC_ACTIVITY_131 HRC_NUMBER_EMPL_4 hrc_activity_131 hrc_number_empl_4 ## 3 2022.SAL table_2022_SAL_1 2022 SAL HRC_ACTIVITY_131 HRC_LEGAL_FORM_3 hrc_activity_131 hrc_legal_form_3 ## 4 2022.SAL table_2022_SAL_2 2022 SAL HRC_ACTIVITY_131 HRC_NUMBER_EMPL_4 hrc_activity_131 hrc_number_empl_4 ## 5 2022.SAL_DTH table_2022_SAL_DTH_1 2022 SAL_DTH HRC_ACTIVITY_131 HRC_LEGAL_FORM_3 hrc_activity_131 hrc_legal_form_3 ## 6 2022.SAL_DTH table_2022_SAL_DTH_2 2022 SAL_DTH HRC_ACTIVITY_131 HRC_NUMBER_EMPL_4 hrc_activity_131 hrc_number_empl_4
Ultimately, there are 6 tables to process in 3 different clusters. In other words, tab_multi_manager()
will need to be called three times.
The create_edges step in the metadata analysis identifies tables included within other tables. For example, XXXXX is included in XXXXX. The following code allows visualizing these inclusions using graphs to better understand the analysis procedure.
library(rtauargus) library(igraph) library(visNetwork) graph_links_tab <- function(list_desc_links){ list_desc_links %>% purrr::imap(function(ss_dem,i){ if(!is.null(ss_dem)){ nodes <- data.frame(id = unique(unlist(ss_dem))) visNetwork(nodes = nodes, edges = ss_dem, main = i) %>% visIgraphLayout() %>% visEdges(shadow = TRUE, arrows =list(to = list(enabled = TRUE, scaleFactor = 0.5)), color = list(color = "steelblue", highlight = "red")) %>% visOptions(highlightNearest = list(enabled = T, hover = T), nodesIdSelection = T) } }) } data(metadata_pizza_lettuce) detailed_analysis <- analyse_metadata(metadata_pizza_lettuce, verbose = TRUE) graph_links_tab(detailed_analysis$create_edges)
## $france_entreprises_2023.hrc_lettuce ## ## $france_entreprises_2023.to_pizza
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.