knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(eurobarometer) library(dplyr) library(tibble) library(knitr) library(kableExtra) library(rprojroot) # The examples of this vignette can be found run with # source( # file.path("not_included", "vignette_vocabulary_examples.R") # )
The procedure involves four steps:
Concept definition
: Trust in institutions: general trust without specifying the concrete domains of activity where the trust applies (these specific questions would require special treatment).
Selection of survey questions
that correspond to the concept by filtering on variable and value labels.
Standardizing value labels
across questions.
Standardizing variable name
s to have the same name for all variables that correspond to the same question.
metadata_database
is a table with variable metadata created from Eurobarometer SPSS files. It has the following columns:
var_name_orig
: variable name in the original dataset
class_orig
: column class in the original dataset
var_label_orig
: variable label in the original dataset
var_label_norm
: normalized variable label
var_name_suggested
: suggested variable label
factor_levels
: vector of value labels as a list (they have no fixed length)
n_categories
: total number of response options = length(factor_levels)
class_suggested
: suggested conversion to an R class, it should correspond later to a conversion function, so that the researcher can just simply approve and get a correct R representation.
filename
: original name of file as obtained from GESIS; it contains a disambigous version information, too.
val_numeric_orig
: numeric value code (if available, empty otherwise)
val_label_orig
: value label in the original dataset corresponding to a response option
val_order_alpha
: alphabetical number (position) of response option in the set of value labels, after sorting with sort()
(we use alphabetical order instead of levels()
because levels may not be the same in different survey files, and we revert to a basic but disambigous sorting)
val_order_length
:
val_label_norm
: value label normalized with label_normalize()
# Avoid changes in the working directory when building vignettes: metadata_rel_path_from_root <- find_root_file( "data-raw", "eb_metadata_database_large.rds", criterion = has_file("DESCRIPTION")) metadata_database <- readRDS( metadata_rel_path_from_root )
data("metadata_filter_example") metadata_database <- metadata_filter_example rm(metadata_filter_example)
This is an iterative process.
Let's filter out questions with trust
in their variable labels.
select_metadata_vars <- c( "filename", "var_name_orig", "var_label_orig", "var_label_norm", "val_label_orig", "val_label_norm", "val_numeric_orig", "val_order_alpha", "n_categories") trust_metadata <- metadata_database %>% filter ( grepl( "trust", var_label_norm ) ) %>% select ( all_of(select_metadata_vars) ) %>% arrange ( var_label_norm, val_label_norm, filename ) trust_metadata %>% head(10) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Scrolling through the normalized variable names (see ?label_normalize
) created from the GESIS SPSS variable labels, it looks like binary value trust in institutions questions have in their label trust in institutions
or trust political parties
or - trust
at the end. Let's take a closer look at those questions and see what response options they have.
Several variables from Eurobarometer 69.2
(ZA4744) ask more detailed questions about specific reasons for trusting or not trusting selected institutions. These are different types of questions with more categories than the more general trend questions with a binary value tend to trust
- tend not to trust
.
trust_metadata <- trust_metadata %>% filter ( grepl( "trust_in_institutions|trust_political_parties|_trust$", var_label_norm ) ) %>% filter( ! (filename == "ZA4744_v5-0-0.sav" & var_name_orig %in% c("v336", "v347","v292", "v303", "v314", "v325", "v270", "v281") ) ) trust_metadata %>% arrange(filename, var_name_orig, val_order_alpha) %>% head(10) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Check how many unique response values do the selected variables have?
trust_metadata %>% arrange(val_label_norm) %>% select(val_label_norm, everything()) %>% count(n_categories) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
These variables have 2, 3 or 4, or as many as 13 unique response options. Let's take a look at the one with 13 unique responses:
trust_metadata %>% filter(n_categories == 13) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
The question asks: Who do you trust the most to fight effectively in [country] against the European Union and its budget being defrauded? (see documentation in the GESIS documentation).
This refers to trust with regard to a specific activity, so does not fit into our concept definition (see procedure.) Let's see what we are left with after excluding it.
trust_metadata <- trust_metadata %>% filter( ! ( filename == "ZA3938_v1-0-1.sav" & var_name_orig == "v511") )
trust_metadata %>% count(val_label_norm) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
dk
stands for "don't know",inap*
stands for "inapplicable" for various reasons, which are maybe worth exploring,mentioned
and not mentioned
suggest questions where the respondent is asked to list all objects to which some condition applies, e.g. "here is a list of institutions. which of those would you say you trust?". These questions require special attention, because "no" is blended with "no answer / don't know". Plus, it's a very different type of question for the respondent,na
and lt_na_gt
are missing values, tend not to trust
is a negative answer to the trust question,tend to trust
is a positive answer to the trust question,you generally do not trust stories published on online social networks
is a mystery.Let's start with the mystery.
trust_metadata %>% filter( grepl("you_generally_do_not_trust", val_label_norm) ) %>% select(-var_label_norm) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
That's likely not what we are interested in, but still worth checking with the GESIS documentation.
The question reads QD5 When you see or read a story published on online social networks, what makes you consider the story trustworthy?
and has several "mentioned/not mentioned" items. This follows the usual multiple choice question structure of Eurobarometer, where all possible choice options are coded as separate variables, and one of them ended with "trust", which is why we filtered it out. We don't want it and will remove it. This should generally be fixed with filtering out multiple choice questions because they follow a particular coding, seePrefix conventions
We refine the variable filter.
trust_metadata <- trust_metadata %>% filter( ! ( filename == "ZA6861_v1-2-0.sav" & var_name_orig == "qd5.6" ) )
trust_metadata %>% count(val_label_norm) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Now let's at the questions that have the mentioned
- not mentioned
response options. It's not what we want.
exclusions <- trust_metadata %>% filter(val_label_norm %in% c("mentioned", "not_mentioned")) %>% print() %>% count(filename, var_name_orig)
trust_metadata <- trust_metadata %>% anti_join( # with a filtering join exclude all from iteration 4 exclusions, by = c("filename", "var_name_orig") )
trust_metadata %>% count(val_label_norm) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Now we need to create a table with all value labels, as well as topics / variable types.
val_labels_trust <- trust_metadata %>% count(val_label_norm) trust_vocabulary <- tibble::tibble ( # maybe we can use a generic controlled vocabulary, for example # Library of Congress topic_1 = 'trust institutions', # And if we find them, we can add GESIS or TNS/Kantar keywords here topic_2 = 'trust, binary', val_label_norm = val_labels_trust %>% pull(val_label_norm), level = 3 # missingness should be harmonized in character form )
trust_vocabulary %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
We add the harmonized values and labels, as well as a flag indicating whether the value is substantive of one of missing value codes, to create the trust_values_table
.
trust_values_table <- trust_vocabulary %>% mutate(character_value = case_when( # and create a surely harmonized character representation grepl("dk|inap|na", val_label_norm) ~ NA_character_, substr(val_label_norm, 1,7) == "tend_to" ~ "tend_to_trust", substr(val_label_norm, 1,11) == "tend_not_to" ~ "tend_not_to_trust", TRUE ~ "ERROR"), numeric_value = case_when ( character_value == "tend_to_trust" ~ 1, character_value == "tend_not_to_trust" ~ 0, TRUE ~ NA_real_), missing = case_when ( # it is useful for faster filtering of missingness # and true value labels is.na(numeric_value) ~ TRUE, TRUE ~ FALSE) ) %>% arrange(numeric_value)
trust_values_table %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
The trust_variable_table
will include original and harmonized variable names, as well as keywords and topics to match with the trust_values_table
.
trust_variable_table <- trust_metadata %>% filter ( val_label_norm %in% trust_values_table$val_label_norm ) %>% filter ( ! grepl("_recoded", var_label_norm ) ) %>% mutate ( # normalize the names of institutions institution = gsub( "trust_in_institutions_|in_|_trust|trust_|[0-9]_|^a_|^b_|^q16c_|the_", "", var_label_norm ), institution = gsub( "charitable_org", "charities", institution ), institution = gsub( "europ_court_of_auditors|eur_court_of_auditors", "eu_court_of_auditors", institution ), institution = gsub( "europ_court_of_justice|european_court_of_justice", "eu_court_of_justice", institution ), institution = gsub( "justice_legal_system|justice_nat_legal_system", "justice", institution ), institution = gsub( "nat_parliament", "national_parliament", institution ), institution = gsub( "nat_government", "national_government", institution ), institution = gsub( "non_govmnt_org|non_govnmt_org", "ngo", institution ), institution = gsub( "polit_parties", "political_parties", institution ), institution = gsub( "reg_loc_public_authorities|reg_local_authorities|reg_local_public_authorities|rg_lc_authorities", "reg_loc_authorities", institution ), institution = gsub( "written_press", "press", institution ), institution = gsub( "econ_and_social_committee|economic_and_soc_committee", "econ_and_soc_committee", institution ), ) %>% mutate ( ## handle exceptions from original questionnaire variants geo_qualifier = case_when( grepl("_tcc", institution) ~ "tcc", #Turkish Cypriot Community TRUE ~ NA_character_), institution = gsub("_tcc", "", institution) ) %>% mutate ( ## follow exception handling for special questionnaires var_name_suggested = paste0( "trust_", institution, "_", geo_qualifier) ) %>% mutate ( # remove _na where the no geo_qualifier is present as base case var_name_suggested = gsub("_NA", "", var_name_suggested), ) %>% mutate ( ## add some topical keywords to our table by taste topic_1 = "trust institutions", topic_2 = "trust, binary" ) %>% select ( ## remove aux variables -contains("label"), -all_of(c("val_numeric_orig", "val_order_alpha", "n_categories")) ) %>% rename ( keyword_1 = institution ) %>% distinct_all() %>% mutate(var_name_suggested = ifelse( #one last exception to handle test = var_name_suggested == "trust_political_partiess_tcc", yes = "trust_political_parties_tcc", no = var_name_suggested ) ) %>% arrange(var_name_suggested)
Let's print the first 10 rows of the resulting table
trust_variable_table %>% head(10) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
How many different harmonized variables do we get? 54, each in between 1 and 42 Eurobarometer editions.
trust_variable_table %>% count(var_name_suggested) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.