In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(eurobarometer)
library(dplyr)
library(tibble)
library(knitr)
library(kableExtra)
library(rprojroot)
# The examples of this vignette can be found run with  
# source(
#  file.path("not_included", "vignette_vocabulary_examples.R")
#  )

Procedure {#procedure}

The procedure involves four steps:

Concept definition: Trust in institutions: general trust without specifying the concrete domains of activity where the trust applies (these specific questions would require special treatment).
Selection of survey questions that correspond to the concept by filtering on variable and value labels.
Standardizing value labels across questions.
Standardizing variable names to have the same name for all variables that correspond to the same question.

Metadata

metadata_database is a table with variable metadata created from Eurobarometer SPSS files. It has the following columns:

var_name_orig: variable name in the original dataset
class_orig: column class in the original dataset
var_label_orig: variable label in the original dataset
var_label_norm: normalized variable label
var_name_suggested: suggested variable label
factor_levels: vector of value labels as a list (they have no fixed length)
n_categories: total number of response options = length(factor_levels)
class_suggested: suggested conversion to an R class, it should correspond later to a conversion function, so that the researcher can just simply approve and get a correct R representation.
filename: original name of file as obtained from GESIS; it contains a disambigous version information, too. val_numeric_orig: numeric value code (if available, empty otherwise)
val_label_orig: value label in the original dataset corresponding to a response option val_order_alpha: alphabetical number (position) of response option in the set of value labels, after sorting with sort() (we use alphabetical order instead of levels() because levels may not be the same in different survey files, and we revert to a basic but disambigous sorting)
val_order_length:
val_label_norm: value label normalized with label_normalize()

# Avoid changes in the working directory when building vignettes:
metadata_rel_path_from_root <- find_root_file(
  "data-raw", "eb_metadata_database_large.rds", 
  criterion = has_file("DESCRIPTION"))

metadata_database <- readRDS( metadata_rel_path_from_root )

data("metadata_filter_example")
metadata_database <- metadata_filter_example
rm(metadata_filter_example)

Selection of variables

This is an iterative process.

Iteration 1

Let's filter out questions with trust in their variable labels.

select_metadata_vars <- c(
  "filename", "var_name_orig", "var_label_orig",
  "var_label_norm", "val_label_orig", "val_label_norm", 
  "val_numeric_orig", "val_order_alpha", "n_categories")

trust_metadata <- metadata_database  %>%
  filter (
    grepl( "trust", var_label_norm )
  )  %>%
  select ( all_of(select_metadata_vars) ) %>%
  arrange ( var_label_norm, val_label_norm, filename )

trust_metadata %>%
  head(10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Scrolling through the normalized variable names (see ?label_normalize) created from the GESIS SPSS variable labels, it looks like binary value trust in institutions questions have in their label trust in institutions or trust political parties or - trust at the end. Let's take a closer look at those questions and see what response options they have.

Several variables from Eurobarometer 69.2 (ZA4744) ask more detailed questions about specific reasons for trusting or not trusting selected institutions. These are different types of questions with more categories than the more general trend questions with a binary value tend to trust - tend not to trust.

Iteration 2

trust_metadata <- trust_metadata %>% 
  filter (
    grepl( 
      "trust_in_institutions|trust_political_parties|_trust$",
      var_label_norm )
  ) %>%
  filter( ! (filename == "ZA4744_v5-0-0.sav" &
               var_name_orig %in% c("v336", "v347","v292", "v303",
                                    "v314", "v325", "v270", "v281")
             )
          )

trust_metadata %>%
  arrange(filename, var_name_orig, val_order_alpha) %>%
  head(10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Check how many unique response values do the selected variables have?

trust_metadata %>%
  arrange(val_label_norm) %>%
  select(val_label_norm, everything()) %>% 
  count(n_categories) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

These variables have 2, 3 or 4, or as many as 13 unique response options. Let's take a look at the one with 13 unique responses:

trust_metadata %>%
  filter(n_categories == 13) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

The question asks: Who do you trust the most to fight effectively in [country] against the European Union and its budget being defrauded? (see documentation in the GESIS documentation).

This refers to trust with regard to a specific activity, so does not fit into our concept definition (see procedure.) Let's see what we are left with after excluding it.

Iteration 3

trust_metadata <- trust_metadata %>%
  filter( ! ( filename == "ZA3938_v1-0-1.sav" &
                var_name_orig == "v511")
          )

trust_metadata %>%
  count(val_label_norm) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

dk stands for "don't know",
inap* stands for "inapplicable" for various reasons, which are maybe worth exploring,
mentioned and not mentioned suggest questions where the respondent is asked to list all objects to which some condition applies, e.g. "here is a list of institutions. which of those would you say you trust?". These questions require special attention, because "no" is blended with "no answer / don't know". Plus, it's a very different type of question for the respondent,
na and lt_na_gt are missing values,
tend not to trust is a negative answer to the trust question,
tend to trust is a positive answer to the trust question,
you generally do not trust stories published on online social networks is a mystery.

Let's start with the mystery.

trust_metadata %>% 
  filter( 
    grepl("you_generally_do_not_trust", val_label_norm)
    ) %>%
  select(-var_label_norm) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

That's likely not what we are interested in, but still worth checking with the GESIS documentation.

The question reads QD5 When you see or read a story published on online social networks, what makes you consider the story trustworthy? and has several "mentioned/not mentioned" items. This follows the usual multiple choice question structure of Eurobarometer, where all possible choice options are coded as separate variables, and one of them ended with "trust", which is why we filtered it out. We don't want it and will remove it. This should generally be fixed with filtering out multiple choice questions because they follow a particular coding, seePrefix conventions

We refine the variable filter.

Iteration 4

trust_metadata <- trust_metadata %>% 
  filter( ! ( filename == "ZA6861_v1-2-0.sav" &
                var_name_orig == "qd5.6"
              )
          )

trust_metadata %>%
  count(val_label_norm) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Now let's at the questions that have the mentioned - not mentioned response options. It's not what we want.

exclusions <- trust_metadata %>%
  filter(val_label_norm %in% c("mentioned", "not_mentioned")) %>%
  print() %>%
  count(filename, var_name_orig)

Iteration 5

trust_metadata <- trust_metadata %>%
  anti_join(
    # with a filtering join exclude all from iteration 4
    exclusions, 
    by = c("filename", "var_name_orig")
    )

trust_metadata %>%
  count(val_label_norm) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Standardizing value labels across questions

Now we need to create a table with all value labels, as well as topics / variable types.

val_labels_trust <- trust_metadata %>%
  count(val_label_norm)

trust_vocabulary <- tibble::tibble (
  # maybe we can use a generic controlled vocabulary, for example
  # Library of Congress
  topic_1 = 'trust institutions',
  # And if we find them, we can add GESIS or TNS/Kantar keywords here
  topic_2 = 'trust, binary',
  val_label_norm = val_labels_trust %>% pull(val_label_norm), 
  level = 3 # missingness should be harmonized in character form 
)

trust_vocabulary %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

We add the harmonized values and labels, as well as a flag indicating whether the value is substantive of one of missing value codes, to create the trust_values_table.

trust_values_table <- trust_vocabulary %>%
  mutate(character_value = case_when(
      # and create a surely harmonized character representation 
      grepl("dk|inap|na", val_label_norm) ~ NA_character_, 
      substr(val_label_norm, 1,7) == "tend_to" ~ "tend_to_trust",
      substr(val_label_norm, 1,11) ==
        "tend_not_to" ~ "tend_not_to_trust",
    TRUE ~ "ERROR"), 
    numeric_value = case_when ( 
      character_value == "tend_to_trust"     ~ 1,
      character_value == "tend_not_to_trust" ~ 0,
      TRUE ~ NA_real_),
    missing = case_when (
      # it is useful for faster filtering of missingness
      # and true value labels  
      is.na(numeric_value) ~ TRUE, 
      TRUE ~ FALSE)
  ) %>%
  arrange(numeric_value)

trust_values_table %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Standardizing variable names

The trust_variable_table will include original and harmonized variable names, as well as keywords and topics to match with the trust_values_table.

trust_variable_table <- trust_metadata %>%
  filter ( 
    val_label_norm %in% trust_values_table$val_label_norm 
    ) %>%
  filter ( ! grepl("_recoded", var_label_norm ) ) %>%
  mutate ( 
    # normalize the names of institutions
    institution = gsub(
      "trust_in_institutions_|in_|_trust|trust_|[0-9]_|^a_|^b_|^q16c_|the_", 
      "", var_label_norm ),
    institution = gsub(
      "charitable_org", "charities", institution ),
    institution = gsub(
      "europ_court_of_auditors|eur_court_of_auditors",
      "eu_court_of_auditors", institution ),
    institution = gsub(
      "europ_court_of_justice|european_court_of_justice",
      "eu_court_of_justice", institution ),
    institution = gsub(
      "justice_legal_system|justice_nat_legal_system",
      "justice", institution ),
    institution = gsub(
      "nat_parliament", "national_parliament", institution ),
    institution = gsub(
      "nat_government", "national_government", institution ),
    institution = gsub(
      "non_govmnt_org|non_govnmt_org", "ngo", institution ),
    institution = gsub(
      "polit_parties", "political_parties", institution ),
    institution = gsub(
      "reg_loc_public_authorities|reg_local_authorities|reg_local_public_authorities|rg_lc_authorities", 
      "reg_loc_authorities", institution ),
    institution = gsub(
      "written_press", "press", institution ),
    institution = gsub(
      "econ_and_social_committee|economic_and_soc_committee", 
      "econ_and_soc_committee", institution ),
  ) %>%
  mutate ( 
    ## handle exceptions from original questionnaire variants
    geo_qualifier = case_when(
      grepl("_tcc", institution) ~ "tcc", #Turkish Cypriot Community
      TRUE ~ NA_character_),
    institution = gsub("_tcc", "", institution)
  ) %>%
  mutate ( 
    ## follow exception handling for special questionnaires
    var_name_suggested = paste0(
      "trust_", institution, "_", geo_qualifier)
    ) %>%
  mutate (
    # remove _na where the no geo_qualifier is present as base case
    var_name_suggested = gsub("_NA", "", var_name_suggested),
  ) %>%
  mutate ( 
    ## add some topical keywords to our table by taste
    topic_1 = "trust institutions",
    topic_2 = "trust, binary"
    ) %>%
  select ( 
    ## remove aux variables 
    -contains("label"),
    -all_of(c("val_numeric_orig", 
              "val_order_alpha", 
              "n_categories"))
  ) %>%
  rename ( keyword_1 = institution ) %>%
  distinct_all() %>%
  mutate(var_name_suggested = ifelse(
    #one last exception to handle
    test = var_name_suggested == "trust_political_partiess_tcc",
    yes = "trust_political_parties_tcc", 
    no = var_name_suggested )
    ) %>%
  arrange(var_name_suggested)

Let's print the first 10 rows of the resulting table

trust_variable_table %>% 
  head(10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

How many different harmonized variables do we get? 54, each in between 1 and 42 Eurobarometer editions.

trust_variable_table %>%
  count(var_name_suggested) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )