knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(eurobarometer) library(knitr) library(dplyr) library(kableExtra)
At this point this workflow if purely hypothetical, the functions are not written, and they are not evaluated.
I do not see a lot of sense in automating this, because the new GESIS interface requires a lot of approvals and interaction. However, I think that the files should be in a single folder. Maybe this could be a tempdir()
but I am not very enthusiastic about it because it causes documentation issues.
Working with native SPSS files is extremely slow, and I think it would make sense to first read and re-save them as .rds files or .rda files.
import_file_names <- c( 'ZA7576_sample','ZA6863_sample') my_survey_list <- read_surveys ( import_file_names, .f = 'read_example_file' )
my_metadata <- gesis_metadata_create( my_survey_list ) names(my_metadata)
The my_metadata table is rather large, so let's review random samples from a few columns.
The variable names need to be harmonized, and you will get suggestions on how to do it. Only a very small subset of the entire table is shown here:
my_metadata %>% dplyr::filter( var_name_orig %in% c("doi", "isocntry", "qa14_3", "d71b_3", "filename", "d7", "w1", "p1")) %>% select ( all_of(c("filename", "var_name_orig", "var_label_orig", "var_name_suggested"))) %>% dplyr::distinct_all() %>% dplyr::distinct ( var_name_suggested, .keep_all = TRUE ) %>% arrange ( var_name_suggested ) %>% kable () %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
The metadata can help identifying questionnaire item types and it suggests conversion to R classes. Again, only a very small subset of the entire table is shown below:
subset_metadata_1 <- my_metadata %>% dplyr::filter( var_name_orig %in% c( "doi", "isocntry", "qa14_3", "d71b_3", "filename", "d7", "w1", "p1")) %>% select ( all_of( c("var_label_norm", "var_name_suggested", "conversion_suggested", "length_cat_range", "length_na_range"))) %>% distinct_all() %>% distinct ( var_name_suggested, .keep_all = TRUE )
subset_metadata_1 %>% arrange ( var_name_suggested ) %>% kable () %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
subset_metadata_2 <- my_metadata %>% dplyr::filter( var_name_orig %in% c( "doi", "isocntry", "qa14_3", "d71b_3", "filename", "d7", "w1", "p1")) %>% select ( all_of( c("qb", "var_name_orig", "var_name_suggested", "conversion_suggested"))) %>% distinct_all() %>% distinct ( var_name_suggested, .keep_all = TRUE ) %>% arrange ( qb ) subset_metadata_2 %>% kable () %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
id
groups serve only identification purposes, and will be used in the skeleton of the panel.metadata
group relate to information about the responses. We include here the country ID because it determines the weight(s) to be used. weigths
group shows the weights calculated by Kantar or GESIS.socio-demography
group shows identified socio-demography variables for which eurobarometer
provides a built-in harmonization tool. There may be other variables that the research would like to add into this group by overriding the qb
value of certain question ids.trust
group relates to a variable group covered by our example harmonization table trust_table
.not_identified
group contains variables for which we do not offer a full-scale built-in harmonization. However, some of our helper functions do help harmonizing these variables, too, but they require more manual programming work by the user.## filter out the most basic and omnipresent id variables, and the ## most basic weights, w1 and its projected version wex my_panel <- panel_create ( survey_list = my_survey_list, ## must be at least two, and one must be the uniqid ## of the file or row_id id_vars = c("uniqid", "doi") ) names(my_panel)
Let's have a look at 6 randomly selected rows:
sample_n( my_panel, 6) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
This should return a very basic file for joining, a data.frame/tibble with
a truly unique id
id elements for joining in individual survey data, in this example, uniqid
and filename
must be present in all imported files. The filename was added by read_surveys()
.
## The id's are all harmonized to character value, they are ## not consistent in the original SPSS files. id_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, id_vars = c("uniqid", "doi"), question_block = "id", var_name = "var_name_suggested", conversion = "conversion_suggested" ) ## Weights are harmonized to numeric. weight_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, question_block = "weights", id_vars = c("uniqid", "doi"), var_name = "var_name_suggested", conversion = "conversion_suggested" ) ## Metadata is harmonized to various classes, but mainly character. metadata_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, question_block = "metadata", var_name = "var_name_suggested", conversion = "conversion_suggested" ) ## Demography panel is harmonized uniquely, but this is not well ## developed yet. demography_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, id_vars = c("uniqid", "doi"), question_block = "socio-demography", var_name = "var_name_suggested", conversion = "conversion_suggested" )
The trust panel contains harmonized labelled variables. They can be converted to a harmonized numeric value with consistent binary values and consistent treatment of inappropriate
and declined
values.
This is only working with binary vars, the rest is converted to character.
trust_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, question_block = "trust", var_name = "var_name_suggested", conversion = "conversion_suggested" )
The trust_vector
below contains the harmonized numeric values, harmonized labels alongside the original values and the original labels. This means that all conversion is reversible.
trust_vector <- unique ( trust_panel$council_of_the_eu_trust) str(trust_vector)
The harmonize_to_numeric()
function correctly gives the numeric value, considering the two sources of missingness.
trust_panel %>% mutate_at( vars(-all_of("panel_id")), harmonize_to_numeric ) %>% summarize_if ( is.numeric, mean, na.rm=TRUE) %>% tidyr::pivot_longer( cols = everything()) %>% kable()
It is likely that your computer's memory will not be enough to left_join
these data tables, so bind them in long format:
panels <- list ( id_panel, trust_panel, demography_panel, weight_panel, metadata_panel ) long_panels <- lapply (panels, function(x) tidyr::pivot_longer ( x, -all_of("panel_id") ) ) panel_long <- do.call( rbind, long_panels ) my_panel <- panel_long %>% tidyr::pivot_wider () %>% left_join ( id_panel, by = 'panel_id' )
The advantage of this workflow is that we can separately work on the
demography_panel
, metadata_panel
, trust_panel
.
The harmonize_qb in turn cares systematically for variables types, such as multiple_choice_questions, two-level_factors, three-level_factors, numeric, character for constants, etc.
This approach is not sensitive to missing questions. If some trust questions are present in all files, and others only in a few, you will still get a full panel.
# Print the harmonized and the original value labels labelled::val_labels(trust_panel$council_of_the_eu_trust) attr(trust_panel$council_of_the_eu_trust, "labels_orig")
# Summarize them as factors summary ( labelled::to_factor(trust_panel$council_of_the_eu_trust)) # Summarize them as numeric summary ( harmonize_to_numeric(trust_panel$council_of_the_eu_trust))
You can simply revert to the original value labels:
labelled::val_labels( trust_panel$council_of_the_eu_trust ) <- attr( trust_panel$council_of_the_eu_trust, "labels_orig") summary(labelled::to_factor(trust_panel$council_of_the_eu_trust))
num_value_range <- unique(trust_panel$trust_in_institutions_media) as.numeric(num_value_range) num_value_range [!is.na(num_value_range)] summarize_values <- data.frame ( harmonized_numeric = labelled::val_labels( trust_panel$trust_in_institutions_media), original_numeric = attr( trust_panel$trust_in_institutions_media, "num_orig"), original_labels = names(attr( trust_panel$trust_in_institutions_media, "labels_orig")) )
kable(summarize_values) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
What I had been corresponding with GESIS is that from there we could have the following outputs:
When we first finish a big batch, and include it in the panel, we create a trend file that will be published on GESIS, and it will be a separate data publication with its own doi. We can place it, for example, on Zenodo or figshare, too.
We can also create a methodological publication from the standard steps we made
We can produce publications of the topical trend elements, probably with other researchers who know more about the topic, such as climate change.
Therefore, we have the possibility to first exploit whatever we create, and the workflow remains transparent and reproducible. Other users can clear up other elements, and if they are good, we can ask them to join their part in as a contribution to the package.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.