knitr::opts_chunk$set( message = FALSE, warning = FALSE, collapse = TRUE, comment = "#>" ) options(width = 160)
library(eurobarometer) library(rio) library(tidyverse) library(tibble) library(knitr) library(kableExtra) library(tidytext) library(janitor) library(stringi)
An important aspect of survey data harmonization -- in addition to standardizing response coding and value labels -- is the identification of variables that measure the same thing in the same way, and standardizing the names of these variables. The process is faciliated by the fact that variables typically have labels that succintly describe the content of the variable, and sometimes also contain information about the coding of responses (e.g., length of scale), the unit (e.g., age in years or categories) and whether the variable contains "original" respondent responses or some transformation of responses (e.g. recoding or index). Variable labels, on the other hand, provide information about the response options, including their number (corresponding to the number of substantive response options) and content.
Thus, there is quite a lot of information in the variable metadata that can be used to identify:
variables that have the same or very similar variable labels,
variables of the same type in terms of item design in terms of value labels - especially items of the "mentioned / not mentioned" type that are used to represent questions with multiple items where the Respondent chooses more than one option, but also -- occasionally -- questions, where the Respondent chooses only one option.
Thanks to variable and value labels, it becomes possible to identify variables that measure the same concept in the same way in different EB waves without studying the documentation.
This vignette outlines these procedures, including the tokenization of normalized variable labels (var_label_norm
), which are created by label_normalize
as part of gesis_metadata_create
. The logic is similar to text analysis, in that a document-term matrix is created with each variable label treated as a (very short) document, and terms treated as keywords.
Read survey and variable metadata
a. Match variables to variable groups as defined by GESIS-ZACAT, filter out technical and protocol variables
Standardize variable labels
a. Standardize the spelling of certain often abbreviated words (e.g., 'gov', 'govnmt', and 'govmnt' for 'government')
b. Tokenize, eliminate stop words (but be careful to not eliminate too many, e.g. 'right' or 'working')
c. Create a document-term matrix (actually, variable label - keyword matrix)
d. Sort the columns according to the declining frequency of the keywords
e. Create standardized variable labels by concatenating keywords
Identify identical or equivalent sets of value labels
The table eb_metadata_var_groups
contains metadata from 96 Eurobarometer files (each file corresponding to one EB round) combined with information on variable groups from GESIS ZACAT.
# uses eb_metadata_database_20200628.rds eb_metadata_var_groups <- readRDS( file.path('../data-raw', 'eb_metadata_var_groups.rds')) %>% mutate(archive_id = substr(filename, 1, 6)) %>% distinct()
eb_metadata_var_groups
is created by joining eb_var_groups_zacat.rds
with 'eb_metadata_database_20200628.rds' after some cleaning of var_label_orig
.
eb_var_groups_zacat.rds
contains info on variable groups for 157 EB waves from 44.2bis (ZA2828) to 92.1 (ZA7579).
Code for this (not run):
metadata_database <- readRDS( file.path('../data-raw', 'eb_metadata_database_20200628.rds')) %>% mutate(archive_id = substr(filename, 1, 6)) eb_var_groups_zacat <- readRDS( file.path('../data-raw', 'eb_var_groups_zacat.rds') ) %>% distinct() archive_ids <- eb_var_groups_zacat %>% distinct(archive_id) %>% pull(archive_id) eb_metadata_var_groups <- metadata_database %>% # filter only waves for which variable groups are available filter(archive_id %in% archive_ids, # exclude this strange variable var_label_orig != "filename") %>% # valid_range is a list, which messes things up, so let's make it a lower case string mutate(valid_range_str = tolower(as.character(valid_range))) %>% # keep one row per variable distinct(filename, qb, var_name_orig, var_name_orig, var_label_orig, var_label_norm, archive_id, valid_range_str, length_cat_range) %>% # original variable labels in the data and in GESIS ZACAT are not exactly the same # --> need some cleaning mutate(var_label_orig2 = gsub("^([A-Z])*[0-9]{1,3}([A-Z]{1,2})* ", "", var_label_orig), var_label_orig2 = gsub("^[A-Z][0-9] [0-9]{1,2} ", "", var_label_orig2), var_label_orig2 = gsub("^W3A/W4A ", "", var_label_orig2), var_label_orig2 = gsub("^[A-Z]{1,2}[0-9]{1,2}([A-Z])*_[A-Z]{1,2} ", "", var_label_orig2), var_label_orig2 = gsub("^\\s+|\\s+$", "", var_label_orig2), var_label_orig2 = gsub('“)', "", var_label_orig2), var_label_orig2 = gsub(" ", " ", var_label_orig2), # these 2 waves have problems and labels in data and in ZACAT are different var_label_orig2 = ifelse(archive_id %in% c("ZA6695", "ZA6697"), gsub("\\s*\\([^\\)]+\\)","", var_label_orig2), var_label_orig2)) %>% # join with table with variable groups left_join(eb_var_groups_zacat, by = c("var_label_orig2" = "var_label_zacat", "archive_id"))
In many multi-wave surveys, such as the Eurobarometer, there is a degree of consistency in how variables are labelled in different editions of the survey, and other survey projects also try to follow some labelling conventions.
This consistency is limited and conventions are not followed strictly, which makes standardizing labels a semi-manual task. Some of the issues stem from the limits in the number of characters of variable labels in common statistical software packages (e.g., SPSS), which leads to non-standard abbreviations.
The dictionary of variable labels is a project-specific (i.e., specific for the Eurobarometer) table that standardizes the spelling of words and abbreviations used in variable names.
var_dict <- import( file.path('../data-raw', 'var_lab_dict.xlsx'), sheet = "dict") %>% gather(id, from, 2:6) %>% select(from, to) %>% drop_na(from) head(var_dict) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
# this part now became very fast with stri_replace_all_regex dat2 <- eb_metadata_var_groups %>% # remove variables in the following groups filter(!grepl("protocol|weight|nationality|nation id|original id|id variable|country group", tolower(var_group)), # remove recoded variables !grepl("recoded|(rec)", var_label_norm)) %>% mutate(var_id = paste0(filename, " ", var_name_orig), # leftover question numbers at the beginning of value labels var_label_norm1 = gsub("^w[0-9]{1,2}", "", var_label_norm), var_label_norm1 = gsub("^[0-9]{1-3}_", "", var_label_norm1), var_label_norm1 = gsub("^[a-z][0-9]{2}[a-z]_", "", var_label_norm1), var_label_norm1 = gsub("^[a-z]{1,2}[0-9]{1,2}_", "", var_label_norm1), var_label_norm1 = gsub("^[a-z]{1,2}[0-9]{1,2}[a-z]_", "", var_label_norm1), var_label_norm1 = gsub("^[0-9]{1,2}_", "", var_label_norm1), var_label_norm1 = gsub("^[a-z]_", "", var_label_norm1), var_label_norm1 = gsub("^[a-z][0-9]_[0-9]{1,2}_", "", var_label_norm1)) %>% # apply the variable label dictionary mutate(var_label_norm1 = stri_replace_all_regex(var_label_norm1, pattern = var_dict$from, replacement = var_dict$to, vectorize_all = FALSE)) %>% ungroup() %>% # select the needed variables select(filename, qb, var_label_orig, var_label_orig2, var_name_orig, var_label_norm, var_label_norm1, var_id, var_group, valid_range_str, length_cat_range)
tidy <- dat2 %>% # tokenize unnest_tokens(word, var_label_norm1, token = "regex", pattern = "_", drop = FALSE) %>% # eliminate stop words, with some exceptions anti_join(stop_words %>% filter( !word %in% c("working", "right", "other") )) %>% # eliminate numbers filter( !grepl("\\b\\d+\\b", word))
tidy %>% select(var_id, var_label_norm1, word) %>% head() %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
tidy %>% count(word) %>% arrange(desc(n)) %>% head(., 10) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
First, we filter out labels that have words that appear in < 5 different waves. If we want to create a trend file, we're only interested in variables that are repeated often.
tidy_small <- tidy %>% group_by(word) %>% # number of distinct file names (waves) per word mutate(n_wave = n_distinct(filename)) %>% # flag words that appear in < 5 different files mutate(to_drop = as.numeric(n_wave < 5)) %>% # group by variable ID (file x variable name) group_by(var_id) %>% # copy to_drop across words in the same variable label mutate(to_drop = max(to_drop, na.rm = TRUE)) %>% # drop variable labels with the to_drop identifier filter(to_drop != 1) %>% # remove auxiliary variables select(-to_drop, -n_wave) %>% ungroup()
The next steps are performed separately within variable groups, and programmed as a loop.
Note that a change to the earlier process is that selected words (so far only "cat" which stands for "categories", I presume) are placed at the end of the document-term matrix, even though they tend to occur frequently, so that they are always at the end.
# split into list by `group` tidy_small_list <- split( tidy_small , f = tidy_small$var_group ) # list of words that go last words_last <- c("cat") # create empty list for standardized variable names dtm2_list <- list() for (i in names(tidy_small_list)) { # create a document-term matrix within each variable group dtm <- tidy_small_list[[i]] %>% distinct(var_id, var_label_norm1, word) %>% mutate(n = word) %>% spread(word, n) # find the order of declining frequency of the keywords order <- dtm %>% summarise_all( list(~ sum(is.na(.))/length(.)) ) %>% t() %>% as.vector() names(order) <- names(dtm) # concatenate keywords to create standardized variable names dtm2_list[[i]] <- dtm %>% select( names(sort(order)), # !!! put selected words last -any_of("cat"), any_of(words_last) ) %>% mutate_at( vars(3:ncol(dtm) ), list( ~ifelse(is.na(.), "", .)) ) %>% unite("var_label_std", 3:ncol(dtm)) %>% mutate(var_label_std = gsub( "(\\_)\\1+", "\\1", var_label_std) ) %>% mutate(var_label_std = gsub( "^_|_$", "", var_label_std) ) } dtm2 <- bind_rows(dtm2_list)
Standardized variable labels var_label_std
are concatenated keywords, where the keywords occur in the order of their frequency in the entire corpus of variable labels. As a consequence, more common - likely broader - terms will come first, and less common - more specific - terms will come later.
metadata_standard_var_names <- dtm2 %>% left_join(dat2)
metadata_standard_var_names %>% select(var_id, var_label_std, var_group) %>% head(., 25) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Print the most common standardized variable labels.
metadata_standard_var_names %>% count(var_label_std) %>% arrange(desc(n)) %>% head(., 25) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Filter on substrings of standardized variable labels var_label_std
, and count the repetitions across datasets.
# filter on var_label_std metadata_standard_var_names %>% filter( grepl("trust|satisf|left-right|interest|class", var_label_std)) %>% count(var_label_std) %>% arrange(desc(n)) %>% head(., 25) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
The dictionary of value labels identifies identical or equivalent sets of value labels.
val_dict <- import( file.path('../data-raw', 'val_lab_dict.xlsx'), sheet = "dict") %>% mutate(from = tolower(from)) %>% select(from, to) head(val_dict) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
So far, seven types of response scales are identified:
mentioned / not mentioned
- coding of responses to questions where the Respondent is asked to choose one or more elements from a list of possibilities; these variables account for the majority of variables in the Eurobarometer, but -- interestingly -- are not generally found in other cross-national survey projects, trust / no trust
- binary response scales to questions about trust in institutions, national government / European Union
- binary responses to questions about whether national governments or the European Union have authority over a given issue domain, yes / no
- binary response scales (exclude types 1-3), positive / negative / neutral
- three-point scales, where the order of options is: positive, negative, neutral, positive / neutral / negative
- three-point scales, where the order of options is: positive, neutral, negative, ordinal, 4 points
- four-point scales that contain ordinal evaluations of a phenomenon, starting from the most positive, typically on a bipolar scale, e.g.: very positive, fairly positive, fairly negative, very negative. metadata_standard_var_names_scales <- metadata_standard_var_names %>% # use values dictionary to tag variables to values_type mutate(values_type = stri_replace_all_fixed(as.character(valid_range_str), pattern = val_dict$from, replacement = val_dict$to, vectorize_all = FALSE)) %>% # identify "mentioned/not mentioned" as a separate type mutate(values_type = ifelse( grepl(".*not mentioned.*", valid_range_str), "mentioned", values_type )) %>% # if type == valid_range, nothing has been learnt, so replace with NA mutate(values_type = ifelse( values_type == as.character(valid_range_str), NA, values_type) ) metadata_standard_var_names_scales %>% count(values_type) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Filter on substrings of standardized variable labels var_label_std
, and count the repetitions across datasets.
# filter on var_label_std metadata_standard_var_names_scales %>% filter( grepl("trust|satisf|left-right|interest|class", var_label_std)) %>% count(var_label_std, values_type) %>% arrange(desc(n)) %>% head(., 25) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.