Case Study: Working with Eurobarometer surveys"
In retroharmonize: Ex Post Survey Data Harmonization

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(retroharmonize)
load(file = system.file(
  "eurob", "eurob_vignette.rda", package = "retroharmonize"))

The goal in this case study is to analyze trust in the national and European parliaments, and in the European Commission, in Europe, with data from the Eurobarometer.

The Eurobarometer is a biannual survey conducted by the European Commission with the goal of monitoring the public opinion of populations of EU member states and -- occasionally -- also in candidate countries. Each EB wave is devoted to a particular topic, but most waves ask some "trend questions", i.e. questions that are repeated frequently in the same form. Trust in institutions is among such trend questions.

The Eurobarometer data Eurobarometer raw data and related documentation (questionnaires, codebooks, etc.) are made available by GESIS, ICPSR and through the Social Science Data Archive networks. You should cite your source, in our examples, we rely on the GESIS data files. In this case study we use nine waves of the Eurobarometer between 1996 and 2019: 44.2bis (January-March 1996), 51.0 (March-April 1999), 57.1 (March-May 2002), 64.2 (October-November 2005), 69.2 (Mar-May 2008) 75.3 (May 2011), 81.2 (March 2014), 87.3 (May 2017), and 91.2 (March 2019).

In the Afrobaromter Case Study we have shown how to merge two waves of a survey with a limited number of variables. This workflow is not feasible with Eurobarometer on a PC or laptop, because there are too many large files to handle.

library(retroharmonize)
library(dplyr)

eurobarometer_waves <- file.path("working", dir("working"))
eb_waves <- read_surveys(eurobarometer_waves, .f='read_rds')

working_dir  <- here::here("working")
eurobarometer_waves <- file.path(working_dir, dir(working_dir))
eb_waves <- read_surveys(eurobarometer_waves, .f='read_rds')

We can review if the main descriptive metadata is correctly present with document_waves().

documented_eb_waves <- document_waves(eb_waves)

Metadata map

We start by extracting metadata from the survey data files and storing them in a tidy table, where each row contains information about a variable from the survey data file. To keep the size manageable, we keep only a few variables: the row ID, the weighting variable, the country code, and variables that contain "parliament" or "commission" in their labels.

eb_trust_metadata <- lapply ( X = eb_waves, FUN = metadata_create )
eb_trust_metadata <- do.call(rbind, eb_trust_metadata)
#let's keep the example manageable:
eb_trust_metadata  <- eb_trust_metadata %>%
  filter ( grepl("parliament|commission|rowid|weight_poststrat|country_id", var_name_orig) )

head(eb_trust_metadata)

The value labels in this example are not too numerous. The only variable that stands out is the one with Can rely on and Cannot rely on labels.

collect_val_labels(eb_trust_metadata)

The following labels were marked by GESIS as missing values:

collect_na_labels(eb_trust_metadata)

We have created a helper function subset_save_survey() that programmatically reads in SPSS files, makes the necessary type conversion to labelled_spss_survey() without harmonization, and saves a small, subsetted rds file. Because this is a native R file, it is far more efficient to handle in the actual workflow.

## You will likely use your own local working directory, or
## tempdir() that will create a temporary directory for your 
## session only. 
working_directory <- tempdir()

# This code is for illustration only, it is not evaluated.
# To replicate the worklist, you need to have the SPSS file names 
# as a list, and you have to set up your own import and export path.

selected_eb_metadata <- readRDS(
  system.file("eurob", "selected_eb_waves.rds", package = "retroharmonize")
  ) %>%
  mutate ( id = substr(filename,1,6) ) %>%
  rename ( var_label = var_label_std ) %>%
  mutate ( var_name = var_label )

## This code is not evaluated, it is only an example. You are likely 
## to have a directory where you have already downloaded the data
## from GESIS after accepting their term use.

subset_save_surveys ( 
  var_harmonization = selected_eb_metadata, 
  selection_name = "trust",
  import_path = gesis_dir, 
  export_path = working_directory )

Harmonize the labels

For easier looping we adopt the harmonize_values() function with new default settings. It would be tempting to preserve the rely labels as distinct from the trust labels, but if we use the same numeric coding, it will lead to confusion. If you want to keep the difference of the two type of category labels, than the harmonization should be done in a two-step process.

harmonize_eb_trust <- function(x) {
  label_list <- list(
    from = c("^tend\\snot", "^cannot", "^tend\\sto", "^can\\srely",
             "^dk", "^inap", "na"), 
    to = c("not_trust", "not_trust", "trust", "trust",
           "do_not_know", "inap", "inap"), 
    numeric_values = c(0,0,1,1, 99997,99999,99999)
  )

  harmonize_values(x, 
                   harmonize_labels = label_list, 
                   na_values = c("do_not_know"= 99997,
                                 "declined"   = 99998,
                                 "inap"       = 99999 )
  )
}

Let's see if things did work out fine:

document_waves(eb_waves)

documented_eb_waves

To review the harmonization on a single survey use pull_survey().

test_trust <- pull_survey(eb_waves, filename = "ZA4414_trust.rds")

Before running our adapted harmonization function, we have this:

test_trust$trust_european_commission[1:16]

After performing harmonization, it would look like this:

harmonize_eb_trust(x=test_trust$trust_european_commission[1:16])

If you are satisfied with the results, run harmonize_eb_trust() through the 9 survey waves. Whenever a variable is missing from a wave, it is filled up with inapproriate missing values.

Harmonize waves

We define a selection of countries: Belgium, Hungary, Italy, Malta, the Netherlands, Poland, Slovakia, and variables.

eb_waves_selected <- lapply ( eb_waves, function(x) x %>% select ( 
  any_of (c("rowid", "country_id", "weight_poststrat", 
            "trust_national_parliament", "trust_european_commission", 
            "trust_european_parliament"))) %>%
    filter ( country_id %in% c("NL", "PL", "HU", "SK", "BE", 
                               "MT", "IT")))

harmonized_eb_waves <- harmonize_waves ( 
  waves = eb_waves_selected, 
  .f = harmonize_eb_trust )

We cannot rely on document_waves() anymore, because the result is a single data frame. Let's have a look at the descriptive metadata.

wave_attributes <- attributes(harmonized_eb_waves)
wave_attributes$id
wave_attributes$filename
wave_attributes$names

Analyze the data

The harmonized data can be analyzed in R. The labelled survey data is stored in labelled_spss_survey() vectors, which is a complex class that retains much metadata for reproducibility. Most statistical R packages do not know it. To them, the data should be presented either as numeric data with as_numeric() or as categorical with as_factor(). (See more why you should not fall back on the more generic as.factor() or as.numeric() methods in The labelled_spss_survey class vignette.)

First, let's treat the trust variables as factors. A summary of the resulting data allows us to screen for values that are outside of the expected range. In the trust variables, any values other than "trust" and "not trust" that are not defined as missing, are unacceptable. In our example, this is not the case. In our example, this is not the case.

We also see some basic information about the weighting factors, which in the selected Eurobarometer subset range from below 0.01 to almost 7. The range of these values is pretty large, which needs to be taken into account when analyzing the data.

harmonized_eb_waves %>%
  mutate_at ( vars(contains("trust")), as_factor ) %>%
  summary()

Now we convert the trust variables to numeric format, and look at the summary. Following the conversion, we lost information about the type of the missing values - now they are all lumped together as NA. What we gained is the proportion of positive responses (which ranges between 0.41 for trust in the national parliament and 0.63 for trust in the European Parliament), and the ability to, e.g., construct scales of the binary variables.

numeric_harmonization <- harmonized_eb_waves %>%
  mutate_at ( vars(contains("trust")), as_numeric )
summary(numeric_harmonization)

Finally, let's calculate weighted means of trust in the national parliament, the European Parliament, and the European Commission, for the selected countries, across all EB waves.

numeric_harmonization %>%
  group_by(country_id) %>%
  summarize_at ( vars(contains("trust")), 
                 list(~mean(.*weight_poststrat, na.rm=TRUE)))