knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(eurobarometer)
library(knitr)
library(dplyr)
library(kableExtra)

At this point this workflow if purely hypothetical, the functions are not written, and they are not evaluated.

Acquiring the data from GESIS

I do not see a lot of sense in automating this, because the new GESIS interface requires a lot of approvals and interaction. However, I think that the files should be in a single folder. Maybe this could be a tempdir() but I am not very enthusiastic about it because it causes documentation issues.

Reading in the files

Working with native SPSS files is extremely slow, and I think it would make sense to first read and re-save them as .rds files or .rda files.

import_file_names <- c(
'ZA7576_sample','ZA6863_sample')

my_survey_list <- read_surveys (
   import_file_names, .f = 'read_example_file' )

Analyse the metadata in the surveys

my_metadata <- gesis_metadata_create( my_survey_list )
names(my_metadata)

The my_metadata table is rather large, so let's review random samples from a few columns.

The variable names need to be harmonized, and you will get suggestions on how to do it. Only a very small subset of the entire table is shown here:

my_metadata %>%
  dplyr::filter( var_name_orig %in% c("doi", "isocntry", "qa14_3", "d71b_3", "filename", "d7", "w1", "p1")) %>%
  select ( all_of(c("filename", "var_name_orig", "var_label_orig", "var_name_suggested"))) %>%
  dplyr::distinct_all() %>%
  dplyr::distinct ( var_name_suggested, .keep_all =  TRUE ) %>%
  arrange ( var_name_suggested ) %>%
  kable () %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )  

The metadata can help identifying questionnaire item types and it suggests conversion to R classes. Again, only a very small subset of the entire table is shown below:

subset_metadata_1 <- my_metadata %>%
  dplyr::filter( var_name_orig %in% c(
    "doi", "isocntry", "qa14_3", "d71b_3",
    "filename", "d7", "w1", "p1")) %>%
  select ( all_of(
    c("var_label_norm", "var_name_suggested",
      "conversion_suggested", "length_cat_range", 
      "length_na_range"))) %>%
  distinct_all() %>%
    distinct ( var_name_suggested, .keep_all =  TRUE ) 
subset_metadata_1 %>%
  arrange ( var_name_suggested ) %>%
  kable () %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
subset_metadata_2 <- my_metadata %>%
  dplyr::filter( var_name_orig %in% c(
    "doi", "isocntry", "qa14_3", "d71b_3", 
    "filename", "d7", "w1", "p1")) %>%
  select ( all_of(
    c("qb", "var_name_orig", "var_name_suggested",
      "conversion_suggested"))) %>%
  distinct_all() %>%
    distinct ( var_name_suggested, .keep_all =  TRUE ) %>%
  arrange ( qb )

subset_metadata_2 %>%
  kable () %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )  

Creating a skeleton panel

## filter out the most basic and omnipresent id variables, and the
## most basic weights, w1 and its projected version wex

my_panel <- panel_create ( survey_list = my_survey_list, 
               ## must be at least two, and one must be the uniqid
               ## of the file or row_id 
                          id_vars = c("uniqid", "doi")
               )

names(my_panel)

Let's have a look at 6 randomly selected rows:

sample_n( my_panel, 6) %>%
  kable() %>%  
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                fixed_thead = T,
                font_size = 10 )

This should return a very basic file for joining, a data.frame/tibble with a truly unique id id elements for joining in individual survey data, in this example, uniqid and filename must be present in all imported files. The filename was added by read_surveys().

Harmonizing various aspects of the survey

## The id's are all harmonized to character value, they are 
## not consistent in the original SPSS files. 
id_panel        <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  id_vars = c("uniqid", "doi"),
  question_block = "id",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" ) 

## Weights are harmonized to numeric.
weight_panel <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "weights",
  id_vars = c("uniqid", "doi"),
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

## Metadata is harmonized to various classes, but mainly character.
metadata_panel <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "metadata",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" ) 

## Demography panel is harmonized uniquely, but this is not well
## developed yet.
demography_panel <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  id_vars = c("uniqid", "doi"),
  question_block = "socio-demography",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" ) 

Value Harmonization

The trust panel contains harmonized labelled variables. They can be converted to a harmonized numeric value with consistent binary values and consistent treatment of inappropriate and declined values.

This is only working with binary vars, the rest is converted to character.

trust_panel <- harmonize_qb_vars( 
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "trust",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" ) 

The trust_vector below contains the harmonized numeric values, harmonized labels alongside the original values and the original labels. This means that all conversion is reversible.

trust_vector <- unique ( trust_panel$council_of_the_eu_trust)
str(trust_vector)

The harmonize_to_numeric() function correctly gives the numeric value, considering the two sources of missingness.

trust_panel %>%
  mutate_at( vars(-all_of("panel_id")), harmonize_to_numeric ) %>%
  summarize_if ( is.numeric, mean, na.rm=TRUE) %>%
  tidyr::pivot_longer( cols = everything()) %>%
  kable()

It is likely that your computer's memory will not be enough to left_join these data tables, so bind them in long format:

panels <- list ( id_panel, 
                 trust_panel,
                 demography_panel,
                 weight_panel, 
                 metadata_panel
                 )

long_panels <- lapply (panels,
                 function(x) tidyr::pivot_longer (
                   x, -all_of("panel_id") )
                )

panel_long <- do.call(
  rbind, long_panels )

my_panel <- panel_long %>%
  tidyr::pivot_wider () %>%
  left_join ( id_panel, by = 'panel_id'  )

The advantage of this workflow is that we can separately work on the demography_panel, metadata_panel, trust_panel.

The harmonize_qb in turn cares systematically for variables types, such as multiple_choice_questions, two-level_factors, three-level_factors, numeric, character for constants, etc.

This approach is not sensitive to missing questions. If some trust questions are present in all files, and others only in a few, you will still get a full panel.

Summary

# Print the harmonized and the original value labels
labelled::val_labels(trust_panel$council_of_the_eu_trust)
attr(trust_panel$council_of_the_eu_trust, "labels_orig")
# Summarize them as factors
summary ( labelled::to_factor(trust_panel$council_of_the_eu_trust))
# Summarize them as numeric
summary ( harmonize_to_numeric(trust_panel$council_of_the_eu_trust))

You can simply revert to the original value labels:

labelled::val_labels(
  trust_panel$council_of_the_eu_trust ) <-  attr(
    trust_panel$council_of_the_eu_trust, "labels_orig")

summary(labelled::to_factor(trust_panel$council_of_the_eu_trust))

Work Documentation

num_value_range <- unique(trust_panel$trust_in_institutions_media)
as.numeric(num_value_range)
num_value_range [!is.na(num_value_range)]

summarize_values <- data.frame (
  harmonized_numeric = labelled::val_labels(
    trust_panel$trust_in_institutions_media), 
    original_numeric = attr( trust_panel$trust_in_institutions_media, "num_orig"),
  original_labels = names(attr( trust_panel$trust_in_institutions_media, "labels_orig"))
) 
kable(summarize_values)  %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

Published results

What I had been corresponding with GESIS is that from there we could have the following outputs:

Therefore, we have the possibility to first exploit whatever we create, and the workflow remains transparent and reproducible. Other users can clear up other elements, and if they are good, we can ask them to join their part in as a contribution to the package.



antaldaniel/eurobarometer documentation built on Aug. 31, 2020, 10:57 p.m.