In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(eurobarometer)
library(knitr)
library(dplyr)
library(kableExtra)

At this point this workflow if purely hypothetical, the functions are not written, and they are not evaluated.

Acquiring the data from GESIS

I do not see a lot of sense in automating this, because the new GESIS interface requires a lot of approvals and interaction. However, I think that the files should be in a single folder. Maybe this could be a tempdir() but I am not very enthusiastic about it because it causes documentation issues.

Reading in the files

Working with native SPSS files is extremely slow, and I think it would make sense to first read and re-save them as .rds files or .rda files.

import_file_names <- c(
'ZA7576_sample','ZA6863_sample')

my_survey_list <- read_surveys (
   import_file_names, .f = 'read_example_file' )

Analyse the metadata in the surveys

my_metadata <- gesis_metadata_create( my_survey_list )
names(my_metadata)

The my_metadata table is rather large, so let's review random samples from a few columns.

The variable names need to be harmonized, and you will get suggestions on how to do it. Only a very small subset of the entire table is shown here:

my_metadata %>%
  dplyr::filter( var_name_orig %in% c("doi", "isocntry", "qa14_3", "d71b_3", "filename", "d7", "w1", "p1")) %>%
  select ( all_of(c("filename", "var_name_orig", "var_label_orig", "var_name_suggested"))) %>%
  dplyr::distinct_all() %>%
  dplyr::distinct ( var_name_suggested, .keep_all =  TRUE ) %>%
  arrange ( var_name_suggested ) %>%
  kable () %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

The metadata can help identifying questionnaire item types and it suggests conversion to R classes. Again, only a very small subset of the entire table is shown below:

subset_metadata_1 <- my_metadata %>%
  dplyr::filter( var_name_orig %in% c(
    "doi", "isocntry", "qa14_3", "d71b_3",
    "filename", "d7", "w1", "p1")) %>%
  select ( all_of(
    c("var_label_norm", "var_name_suggested",
      "conversion_suggested", "length_cat_range", 
      "length_na_range"))) %>%
  distinct_all() %>%
    distinct ( var_name_suggested, .keep_all =  TRUE )

subset_metadata_1 %>%
  arrange ( var_name_suggested ) %>%
  kable () %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

subset_metadata_2 <- my_metadata %>%
  dplyr::filter( var_name_orig %in% c(
    "doi", "isocntry", "qa14_3", "d71b_3", 
    "filename", "d7", "w1", "p1")) %>%
  select ( all_of(
    c("qb", "var_name_orig", "var_name_suggested",
      "conversion_suggested"))) %>%
  distinct_all() %>%
    distinct ( var_name_suggested, .keep_all =  TRUE ) %>%
  arrange ( qb )

subset_metadata_2 %>%
  kable () %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

The id groups serve only identification purposes, and will be used in the skeleton of the panel.
The metadata group relate to information about the responses. We include here the country ID because it determines the weight(s) to be used.
The weigths group shows the weights calculated by Kantar or GESIS.
The socio-demography group shows identified socio-demography variables for which eurobarometer provides a built-in harmonization tool. There may be other variables that the research would like to add into this group by overriding the qb value of certain question ids.
The trust group relates to a variable group covered by our example harmonization table trust_table.
The not_identified group contains variables for which we do not offer a full-scale built-in harmonization. However, some of our helper functions do help harmonizing these variables, too, but they require more manual programming work by the user.

Creating a skeleton panel

## filter out the most basic and omnipresent id variables, and the
## most basic weights, w1 and its projected version wex

my_panel <- panel_create ( survey_list = my_survey_list, 
               ## must be at least two, and one must be the uniqid
               ## of the file or row_id 
                          id_vars = c("uniqid", "doi")
               )

names(my_panel)

Let's have a look at 6 randomly selected rows:

sample_n( my_panel, 6) %>%
  kable() %>%  
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                fixed_thead = T,
                font_size = 10 )

This should return a very basic file for joining, a data.frame/tibble with a truly unique id id elements for joining in individual survey data, in this example, uniqid and filename must be present in all imported files. The filename was added by read_surveys().

Harmonizing various aspects of the survey

## The id's are all harmonized to character value, they are 
## not consistent in the original SPSS files. 
id_panel        <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  id_vars = c("uniqid", "doi"),
  question_block = "id",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" ) 

## Weights are harmonized to numeric.
weight_panel <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "weights",
  id_vars = c("uniqid", "doi"),
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

## Metadata is harmonized to various classes, but mainly character.
metadata_panel <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "metadata",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" ) 

## Demography panel is harmonized uniquely, but this is not well
## developed yet.
demography_panel <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  id_vars = c("uniqid", "doi"),
  question_block = "socio-demography",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

Value Harmonization

The trust panel contains harmonized labelled variables. They can be converted to a harmonized numeric value with consistent binary values and consistent treatment of inappropriate and declined values.

This is only working with binary vars, the rest is converted to character.

trust_panel <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "trust",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

The trust_vector below contains the harmonized numeric values, harmonized labels alongside the original values and the original labels. This means that all conversion is reversible.

trust_vector <- unique ( trust_panel$council_of_the_eu_trust)
str(trust_vector)

The harmonize_to_numeric() function correctly gives the numeric value, considering the two sources of missingness.

trust_panel %>%
  mutate_at( vars(-all_of("panel_id")), harmonize_to_numeric ) %>%
  summarize_if ( is.numeric, mean, na.rm=TRUE) %>%
  tidyr::pivot_longer( cols = everything()) %>%
  kable()

It is likely that your computer's memory will not be enough to left_join these data tables, so bind them in long format:

panels <- list ( id_panel, 
                 trust_panel,
                 demography_panel,
                 weight_panel, 
                 metadata_panel
                 )

long_panels <- lapply (panels,
                 function(x) tidyr::pivot_longer (
                   x, -all_of("panel_id") )
                )

panel_long <- do.call(
  rbind, long_panels )

my_panel <- panel_long %>%
  tidyr::pivot_wider () %>%
  left_join ( id_panel, by = 'panel_id'  )

The advantage of this workflow is that we can separately work on the demography_panel, metadata_panel, trust_panel.

The harmonize_qb in turn cares systematically for variables types, such as multiple_choice_questions, two-level_factors, three-level_factors, numeric, character for constants, etc.

This approach is not sensitive to missing questions. If some trust questions are present in all files, and others only in a few, you will still get a full panel.

Summary

# Print the harmonized and the original value labels
labelled::val_labels(trust_panel$council_of_the_eu_trust)
attr(trust_panel$council_of_the_eu_trust, "labels_orig")

# Summarize them as factors
summary ( labelled::to_factor(trust_panel$council_of_the_eu_trust))
# Summarize them as numeric
summary ( harmonize_to_numeric(trust_panel$council_of_the_eu_trust))

You can simply revert to the original value labels:

labelled::val_labels(
  trust_panel$council_of_the_eu_trust ) <-  attr(
    trust_panel$council_of_the_eu_trust, "labels_orig")

summary(labelled::to_factor(trust_panel$council_of_the_eu_trust))

Work Documentation

num_value_range <- unique(trust_panel$trust_in_institutions_media)
as.numeric(num_value_range)
num_value_range [!is.na(num_value_range)]

summarize_values <- data.frame (
  harmonized_numeric = labelled::val_labels(
    trust_panel$trust_in_institutions_media), 
    original_numeric = attr( trust_panel$trust_in_institutions_media, "num_orig"),
  original_labels = names(attr( trust_panel$trust_in_institutions_media, "labels_orig"))
)

kable(summarize_values)  %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Published results

What I had been corresponding with GESIS is that from there we could have the following outputs:

When we first finish a big batch, and include it in the panel, we create a trend file that will be published on GESIS, and it will be a separate data publication with its own doi. We can place it, for example, on Zenodo or figshare, too.
We can also create a methodological publication from the standard steps we made
We can produce publications of the topical trend elements, probably with other researchers who know more about the topic, such as climate change.

Therefore, we have the possibility to first exploit whatever we create, and the workflow remains transparent and reproducible. Other users can clear up other elements, and if they are good, we can ask them to join their part in as a contribution to the package.

We can also see how much we are Eurobarometer-specific, and modify the package, or create a mutant for other surveys. I think we will not be very Eurobarometer specific.

antaldaniel/eurobarometer documentation built on Aug. 31, 2020, 10:57 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com