In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE, 
  collapse = TRUE,
  comment = "#>"
)
options(width = 160)

library(eurobarometer)
library(dplyr)
library(tibble)
library(knitr)
library(kableExtra)
# The examples of this vignette can be found run with  
# source(
#  file.path("not_included", "vignette_vocabulary_examples.R")
#  )

Procedure

The procedure involves three steps:

Concept definition: Trust in institutions: general trust without speficying the concrete domains of activity where the trust applies (these specific questions would require special treatment).
Selection of survey questions that correspond to the concept by filtering on variable and value labels.
Standardizing value labels across questions.
Standardizing variable names to have the same name for all variables that correspond to the same question.

Metadata

metadata_database is a table with variable metadata created from Eurobarometer SPSS files. It has the following columns:

var_name_orig: variable name in the original dataset
class_orig: column class in the original dataset
var_label_orig: variable label in the original dataset
var_label_norm: normalized variable label
var_name_suggested: suggested variable label
factor_levels: vector of value labels as a list (they have no fixed length)
n_categories: total number of response options = length(factor_levels)
class_suggested: suggested conversion to an R class, it should correspond later to a conversion function, so that the researcher can just simply approve and get a correct R representation.
filename: original name of file as obtained from GESIS; it contains a disambigous version information, too. val_numeric_orig: numeric value code (if available, empty otherwise)
val_label_orig: value label in the original dataset corresponding to a response option val_order_alpha: alphabetical number (position) of response option in the set of value labels, aftering sorting with sort() (we use alphabetical order instead of levels() because levels may not be the same in different survey files, and we revert to a basic but disambigous sorting)
val_order_length:
val_label_norm: value label normalized with label_normalize()

metadata_database <-  readRDS( 
  file.path('..', 'data-raw',
  'eb_metadata_database_large.rds'))

Selection of variables

This is an iterative process.

Iteration 1

Let's filter out questions with trust in their variable labels.

select_metadata_vars <- c("filename", "var_name_orig", "var_label_orig", "var_label_norm", 
                          "val_label_orig", "val_label_norm" , "val_numeric_orig", 
                          "val_order_alpha", "n_categories")

trust_metadata <- metadata_database  %>%
  filter (
    grepl( "trust", var_label_norm )
  )  %>%
  select ( all_of(select_metadata_vars)) %>%
  arrange ( var_label_norm, val_label_norm, filename )

trust_metadata %>%
  head(10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Scrolling through the variable names it looks like binary trust in institutions questions have in their label "trust in institutions" or "trust political parties" or "- trust" at the end. Let's take a closer look at those questions and see what response options they have.

Several variables from Eurobarometer 69.2 (ZA4744) ask more detailed questions about specific reasons for trusting or not trusting selected institutions. These do not fit together with the more general items.

Iteration 2

trust_metadata <- trust_metadata %>% 
  filter (
    grepl( "trust_in_institutions|trust_political_parties|_trust$", var_label_norm )
  ) %>%
  filter( ! (filename == "ZA4744_v5-0-0.sav" & var_name_orig %in% c("v336", "v347","v292", "v303",
                                                                    "v314", "v325", "v270", "v281")))
trust_metadata %>%
  arrange(filename, var_name_orig, val_order_alpha) %>%
  head(10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Check how many unique response values do the selected variables have.

trust_metadata %>%
  arrange(val_label_norm) %>%
  select(val_label_norm, everything()) %>% 
  count(n_categories) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

These variables have 2, 3 or 4, or as many as 13 unique response options. Let's take a look at the one with 13 unique responses:

trust_metadata %>%
  filter(n_categories == 13) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

The question asks: Who do you trust the most to fight effectively in [country] against the European Union and its budget being defrauded? (see documentation in the GESIS documentation).

This refers to trust with regard to a specific activity, so does not fit into our concept definition. Let's see what we are left with after excluding it.

Iteration 3

trust_metadata <- trust_metadata %>%
  filter( ! (filename == "ZA3938_v1-0-1.sav" & var_name_orig == "v511"))

trust_metadata %>%
  count(val_label_norm) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

dk stands for "don't know",
inap* stands for "inapplicable" for various reasons, which are maybe worth exploring,
mentioned and not mentioned suggest questions where the respondent is asked to list all objects to which some condition applies, e.g. "here is a list of institutions. which of those would you say you trust?". These questions require special attention, because "no" is blended with "no answer / don't know". Plus, it's a very different type of question for the respondent,
na and lt_na_gt are missing values,
tend not to trust is a negative answer to the trust question,
tend to trust is a positive answer to the trust question,
you generally do not trust stories published on online social networks is a mystery.

Let's start with the mystery.

trust_metadata %>% 
  filter( grepl("you_generally_do_not_trust", val_label_norm)) %>%
  select(-var_label_norm) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

That's likely not what we are interested in, but still worth checking with the GESIS documentation.

The question reads "QD5 When you see or read a story published on online social networks, what makes you consider the story trustworthy?" and has several "mentioned/not mentioned" items. All those items are listed as a separate variables, and one of them ends with "trust", which is why we filtered it out. We don't want it and will remove it.

We refine the variable filter.

Iteration 4

trust_metadata <- trust_metadata %>% 
  filter( ! (filename == "ZA6861_v1-2-0.sav" & var_name_orig == "qd5.6"))

trust_metadata %>%
  count(val_label_norm) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Now let's at the questions that have the mentioned / not mentioned response options. It's not what we want.

exclusions <- trust_metadata %>%
  filter(val_label_norm %in% c("mentioned", "not_mentioned")) %>%
  print() %>%
  count(filename, var_name_orig)

Iteration 5

trust_metadata <- trust_metadata %>%
  anti_join(exclusions)

trust_metadata %>%
  count(val_label_norm) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Standardizing value labels across questions

Now we need to create a table with all value labels, as well as topics / variable types.

val_labels_trust <- trust_metadata %>%
  count(val_label_norm)

trust_vocabulary <- tibble::tibble (
  # maybe we can use a generic controlled vocabulary, for example
  # Library of Congress
  topic_1 = 'trust institutions',
  # And if we find them, we can add GESIS or TNS/Kantar keywords here
  topic_2 = 'trust, binary',
  val_label_norm = val_labels_trust %>% pull(val_label_norm), 
  level = 3 # missingness should be harmonized in character form 
)

trust_vocabulary %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

We add the harmonized values and labels, as well as a flag indicating whether the value is substantive of one of missing value codes, to create the trust_values_table.

trust_values_table <- trust_vocabulary %>%
  mutate(character_value = case_when(
      # and create a surely harmonized character representation 
      grepl("dk|inap|na", val_label_norm) ~ NA_character_, 
      substr(val_label_norm, 1,7) == "tend_to" ~ "tend_to_trust",
      substr(val_label_norm, 1,11) == "tend_not_to" ~ "tend_not_to_trust",
    TRUE ~ "ERROR"), 
    numeric_value = case_when ( 
      character_value == "tend_to_trust" ~ 1,
      character_value == "tend_not_to_trust" ~ 0,
      TRUE ~ NA_real_),
    missing = case_when (
      # it is useful for faster filtering of missingness
      # and true value labels  
      is.na(numeric_value) ~ TRUE, 
      TRUE ~ FALSE)
  ) %>%
  arrange(numeric_value)

trust_values_table %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Standardizing variable names

The trust_variable_table will include original and harmonized variable names, as well as keywords and topics to match with the trust_values_table.

trust_variable_table <- trust_metadata %>%
  filter ( val_label_norm %in% trust_values_table$val_label_norm ) %>%
  filter ( ! grepl("_recoded", var_label_norm ) ) %>%
  mutate ( institution = gsub("trust_in_institutions_|in_|_trust|trust_|[0-9]_|^a_|^b_|^q16c_|the_", "",
                              var_label_norm ),
           institution = gsub("charitable_org", "charities", institution ),
           institution = gsub("europ_court_of_auditors", "eu_court_of_auditors", institution ),
           institution = gsub("eur_court_of_auditors", "eu_court_of_auditors", institution ),
           institution = gsub("europ_court_of_justice", "eu_court_of_justice", institution ),
           institution = gsub("european_court_of_justice", "eu_court_of_justice", institution ),
           institution = gsub("justice_legal_system", "justice", institution ),
           institution = gsub("justice_nat_legal_system", "justice", institution ),
           institution = gsub("nat_parliament", "national_parliament", institution ),
           institution = gsub("nat_government", "national_government", institution ),
           institution = gsub("non_govmnt_org|non_govnmt_org", "ngo", institution ),
           institution = gsub("polit_parties", "political_parties", institution ),
           institution = gsub("reg_loc_public_authorities", "reg_loc_authorities", institution ),
           institution = gsub("reg_local_authorities", "reg_loc_authorities", institution ),
           institution = gsub("reg_local_public_authorities", "reg_loc_authorities", institution ),
           institution = gsub("rg_lc_authorities", "reg_loc_authorities", institution ),
           institution = gsub("written_press", "press", institution ),
           institution = gsub("econ_and_social_committee", "econ_and_soc_committee", institution ),
           institution = gsub("economic_and_soc_committee", "econ_and_soc_committee", institution )) %>%
  mutate ( geo_qualifier = case_when(
    grepl("_tcc", institution) ~ "tcc",
    TRUE ~ NA_character_),
    institution = gsub("_tcc", "", institution)
  ) %>%
  mutate ( var_name_suggested = paste0("trust_",
                                    institution, "_",
                                    geo_qualifier) ) %>%
  mutate ( var_name_suggested = gsub("_NA", "", var_name_suggested),
           topic_1 = "trust institutions",
           topic_2 = "trust, binary") %>%
  select ( -contains("label"),
           -val_numeric_orig, -val_order_alpha, -n_categories) %>%
  rename ( keyword_1 = institution ) %>%
  distinct_all() %>%
  mutate(var_name_suggested = ifelse(var_name_suggested == "trust_political_partiess_tcc",
                                    "trust_political_parties_tcc", var_name_suggested)) %>%
  arrange(var_name_suggested)

trust_variable_table %>% 
  head(10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

How many different harmonized variables do we get? 54, each in between 1 and 42 Eurobarometer editions.

trust_variable_table %>%
  count(var_name_suggested) %>%
  # arrange(desc(n)) %>%
  # head(10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )