In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE, 
  eval=FALSE,
  collapse = TRUE,
  comment = "#>"
)

library(eurobarometer)
library(dplyr)
library(tibble)
library(knitr)
library(kableExtra)
# The examples of this vignette can be found run with  
# source(
#  file.path("not_included", "vignette_vocabulary_examples.R")
#  )

There is no final metadata solution now. There are two main functions, gesis_metadata_create, which mainly helps with normalizing the variable names, and gesis_vocabulary_create (wrapped by gesis_corpus_create), which help consolidating the vocabulary of questionnaire items.

They are both referring back to normalize_names in normalizing labels. The var_name_suggested is a very early attempt to use a canonical form of same questions that appear under different normalized forms.

metadata_database is a table with variable metadata created from 28 Eurobarometer SPSS files. It has the following columns:

var_name_orig: variable name in the original dataset
class_orig: column class in the original dataset
var_label_orig: variable label in the original dataset
var_label_norm: normalized variable label
var_name_suggested: suggested variable label
factor_levels: vector of value labels as a list (they have no fixed length) n_categories: total number of response options = length(factor_levels)
class_suggested: suggested conversion to an R class, it should correspond later to a conversion function, so that the researcher can just simply approve and get a correct R representation.
filename: original name of file as obtained from GESIS; it contains a unambiguous version information, too. val_numeric_orig: numeric value code (if available, empty otherwise)
val_label_orig: value label in the original dataset corresponding to a response option val_label_norm: value label normalized with label_normalize() val_order_alpha: alphabetical number (position) of response option in the set of value labels, after sorting with sort(). We use alphabetical order instead of levels() because levels may not be the same in different survey files, and we revert to a basic but unambiguous sorting. val_order_length: what's this, and why is it equal to n_categories - 1?

Can we also have the number of categories of non-missing response options? For binary variables this would be 2.

That is a very good question, and relates back to the treatment of NA vars. val_order_alpha shows the distinct values after converting from haven_labelled, and n_categories the number of labels. I fear that sometimes certain value labels have no numeric value in SPSS, and they convert to NA, in other cases they have a numeric value like 9 or 999 and they convert to a number.

I believe that we must put the a lot of energies into understanding various forms of 'missingness' and refused answers, and come up with a coding that is practical for programmatic use, applicable in the R language, and gives a faithful representation of the original survey.

metadata_database <- readRDS(
  file.path("..", "data-raw", "eb_metadata_database.rds")
  )

Let's filter out a typical question type that has Tend to trust and Tend not to trust answer options, with differently encoded Declined to answer options.

select_metadata_vars <- c("filename", "var_name_orig",
                          "var_label_norm", 
                          "var_name_suggested",
                          "val_label_norm" , "val_label_orig",
                          "val_order_alpha", "val_order_length")

trust_metadata <- metadata_database  %>%
  filter (
    grepl( "tend_to_trust|tend_not_to_trust", val_label_norm )
  )  %>%
  select ( all_of(select_metadata_vars)) %>%
  arrange ( var_label_norm, val_label_norm, filename )

To get unique questions, let's leave out the Tend not to trust negative answer options.

require(kableExtra)
trust_metadata %>% filter ( 
   grepl( "tend_to_trust", val_label_norm )
  ) %>%
  sample_n(15) %>% # Print only a sample of 15 rows
kable() %>%
  kable_styling(bootstrap_options =
                c("striped", "hover", "condensed"), 
                fixed_thead = T, 
                font_size = 7)

Let's add back all value labels:

trust_metadata %>%
  select (
    # previously we filtered tend to (not) agree, remove it
    -all_of(c("val_label_norm", "val_label_orig"))
    ) %>%
  left_join ( 
    # add back all value labels, i.e. various declines
    metadata_database %>%
                select ( all_of(select_metadata_vars))
    )  %>% 
  sample_n(15) %>% # Print only a sample of 15 rows
kable() %>%
  kable_styling(bootstrap_options = 
                c("striped", "hover", "condensed"), 
                fixed_thead = T, 
                font_size = 9)

Harmonization of Value Labels

Let's review the problems here.

trust_with_decline  <- metadata_database %>%
  select ( all_of(select_metadata_vars) ) %>%
  semi_join ( trust_metadata %>%
                select ( all_of(c("filename", "var_name_orig"))), 
              by = c("filename", "var_name_orig"))

trust_var_labels <- trust_with_decline %>%
  select ( all_of(c("val_label_orig", "val_label_norm",
                    "val_order_alpha", "val_order_length"))) %>%
  distinct_all () %>%
  arrange ( val_label_norm, val_label_orig, val_order_length)

trust_var_labels %>% 
kable() %>%
  kable_styling(bootstrap_options = 
                c("striped", "hover", "condensed"), 
                fixed_thead = T, 
                font_size = 9 )

We are dealing with different scales here. Let's narrow down.

Some variables appear to have four categories, because the declined answers are differently coded in the Turkish Community of Cyprus. In other cases, though, this is just a different scale.

trust_2_with_decline  <- trust_with_decline %>%
  filter ( val_order_length <= 4)

trust_2_value_labels <- trust_2_with_decline %>%
  select ( all_of(c("val_label_orig", "val_label_norm", 
                    "val_order_alpha", "val_order_length"))
           ) %>%
  distinct_all () %>%
  arrange ( val_label_norm, val_label_orig, val_order_length)

trust_2_value_labels %>% 
kable() %>%
  kable_styling(bootstrap_options = 
                c("striped", "hover", "condensed"), 
                fixed_thead = T, 
                font_size = 10 )

My approach would be this:

Filter out all questions with normalized answer options tend_to_trust and tend_not_to_trust. These can be safely harmonized.
A more conservative approach would be to screen the variables in terms of variable labels and value labels, to get an understanding of how these labels are constructed and how they reflect the content of the variable. Consulting the questionnaires might be inevitable in some cases.
There is clearly a lot of work to be done with the non-response options (don't knows, refusals, logical skips / inapplicables, etc.). I think that this would merit a different function and a different vocabulary, because missingness should consistently treated anyway. The treatment of missinginess in the R language will also raise issues, because the NA values is not class independent.
Agree on a class representation of the value labels and the missingness. A binary variable would lend itself to be numerically represented, but if we want to treat trust as a topic, which is sometimes coded differently, then a factor or character representation is better (and we should probably use their more modern vctrs versions.).
A factor representation is useful if we want to use the ordinary scale of the variable. In this case we must make sure that at each table join, the number and ordering of the factor levels is consistent.
I tend to believe that it is better to add back the ordinary scaling later, because it is not always necessary at all. (For example, the researcher may want to use this variables in PCA or select a few of them as indicators/dummies.) It is far more simple to give a character representation to these labels.
Another idea would be to have everything stored as string variables, which would force the user to decide for themselves what they want to do.
Manually find out what is going on with tend_to_trust_7_8_in_qa8 tend_not_to_trust_1_4_in_qa8. If they can be safely treated as an exception and recoded to tend_to_trust and tend_not_to_trust, add them to the controlled vocabulary.
Leave out anything that has different labels than tend_to_trust, tend_not_to_trust or their missingness value.

trust_vocabulary <- tibble::tibble (
  # maybe we can use a generic controlled vocabulary, for example
  # Library of Congress
  topic_1 = 'trust',
  # And if we find them, we can add GESIS or TNS/Kantar keywords here
  topic_2 = 'trust, binary',
  val_label_norm = trust_2_value_labels %>%
    filter ( 
      grepl("tend_to|tend_not_to|inap|dk", val_label_norm)) %>%
    distinct ( val_label_norm ) %>%
    unlist () %>%
    as.character(), 
  level = 3 # missingness should be harmonized in character form 
) 

trust_table <- trust_vocabulary %>%
  mutate( 
    character_value = case_when(
      # and create a surely harmonized character representation 
      grepl("dk|inap", val_label_norm) ~ NA_character_, 
      substr(val_label_norm, 1,7) == "tend_to" ~ "tend_to_trust", 
    TRUE ~ "tend_not_to_trust"), 
    numeric_value = case_when ( 
      character_value == "tend_to" ~ 1,
      character_value == "tend_not_to" ~ 0,
      TRUE ~ NA_real_),
    missing = case_when (
      # it is useful for faster filtering of missingness
      # and true value labels  
      is.na(numeric_value) ~ TRUE, 
      TRUE ~ FALSE)
  )

trust_table %>%
  kable %>%
  kable_styling(bootstrap_options = 
                  c("striped", "hover", "condensed"), 
                  fixed_thead = T, 
                  font_size = 10 ) %>%
    add_header_above(c("Keywords" = 2, 
                       "Label Identification" = 2, 
                       "Value Harmonization" = 3))

As far as I see, this can be a building block of an important metadata table, for example, a table of two-level categories. As far as I see it, this table is capable of harmonizing the two-level trust variables in all 28 Eurobarometers I checked.

The declined answers should be consistently added to the levels. Currently I used NA_character_ in the character representation, but I think that a harmonized character version, such as the often used declined would be better. This could consistently treated as NA_real_ in a suggested numerical representation, whenever that is applicable.

saveRDS(trust_table, file.path(
  "..",   # we're in vignettes
  "data-raw", "trust_value_labels.rds"),
  version = 2 # downward compatibility on CRAN
  )

Once we have a few more bits, for example, similar elements of other 2-value factors, we'll bind them together in the R file data-raw/create_categorical_var.R.

Only the dataset maintainer should run usethis::use_data(), which creates the final file data/category_labels_2.

When the data file is updated, you should check the documentation file R/data-category_label_2.R. This will be used to automatically update the data documentation. At last, devtools::document() will automatically update the man/data-category_label_2.Rd manual file. Never change the contents of the man directory by hand!

The data documentation from the manual file can be reviewed by ?category_labels_2.

The category_labels_2 should be extended by similar, two-level categorical variables later. The current stage can be reviewed with data(category_labels_2).

Harmonization of Variable Names

Please review Harmonizing Variable Names vignette. Pull requests are welcome: if you want to make changes, or add details, please, modify the variable_names.Rmd file, and not the .Rd or .html or other compiled files. If you are unsure that we are working parallel, just make a copy of the file to not_included with your version.

trust_variable_table <- trust_with_decline %>%
  filter ( val_label_norm %in% trust_table$val_label_norm ) %>%
  filter ( ! grepl("_recoded", var_label_norm ) ) %>%
  distinct ( val_label_norm, .keep_all = TRUE ) %>%
  mutate ( institution =      gsub("trust_in_institutions_|trust_in_|_trust", "",
                                   var_label_norm )) %>%
  mutate ( geo_qualifier = case_when(
    grepl("_tcc", institution) ~ "tcc",
    TRUE ~ NA_character_),
    institution = gsub("_tcc", "", institution)
  ) %>%
  mutate ( var_name_suggested = paste0("trust_in_",
                                    institution, "_",
                                    geo_qualifier) ) %>%
  mutate ( var_name_suggested = gsub("_NA", "", var_name_suggested)) %>%
  select ( filename, var_name_orig, var_label_norm, var_name_suggested, institution, geo_qualifier) %>%
  rename ( keyword_1 = institution )

trust_variable_table %>%
  kable %>%
  kable_styling(bootstrap_options = 
                  c("striped", "hover", "condensed"), 
                  fixed_thead = T, 
                  font_size = 10 ) %>%
    add_header_above(c("Filtering" = 3, 
                       "Preferred Term" = 1, 
                       "Keywords" = 2)
                     )

I think that this can be the basis of a metadata table to use canonical names for 2-level factors.

We should make an effort to harmonize the keywords(s) with accepted thesauri, in this case, for example, with the ELSST.

saveRDS(trust_variable_table, file.path(
  "..",   # we're in vignettes
  "data-raw", "trust_variables.rds"),
  version = 2 # downward compatibility on CRAN
  )

Once we have a few more bits, for example, similar elements of other 2-value factors, we'll bind them together in the R file data-raw/create_categorical_var. [Same as the previous data file, but the data files will be separate.]

Only the dataset maintainer should run usethis::use_data(), which creates the final file data/categorical_variables_2.

When the data file is updated, you should check the documentation file R/data-categorical_variables_2.R. This will be used to automatically update the data documentation. Like in the previous example, with devtools::document() will automatically update the man/data-categorical_variables_2.Rd manual file. Never change the contents of the man directory by hand!

The data documentation from the manual file can be reviewed by ?categorical_variables_2.

The categorical_variables_2 should be extended by similar, two-level categorical variables later. The current stage can be reviewed with data(categorical_variables_2).

Renaming Suggestions

If you are unhappy with any names used in this package itself (see all functions and arguments on the reference page), please, let me know in an email. We should agree on all column names of metadata files, function names and function parameters as early as possible, because it will be extremely difficult to make name changes later as more and more code piles up.

antaldaniel/eurobarometer documentation built on Aug. 31, 2020, 10:57 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

antaldaniel/eurobarometer
Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

Harmonization of Value Labels

Harmonization of Variable Names

Renaming Suggestions

R Package Documentation

Browse R Packages

We want your feedback!

antaldaniel/eurobarometer Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

Harmonization of Value Labels

Harmonization of Variable Names

Renaming Suggestions

R Package Documentation

Browse R Packages

We want your feedback!

antaldaniel/eurobarometer
Tidy Eurobarometer Survey Microdata For Retrospective Harmonization