knitr::opts_chunk$set(echo = TRUE)
knitr::write_bib(c(
  .packages(), 'haven', 'labelled', 'vctrs'
), 'packages.bib')

Read the current knitted version on Rpubs

The increasing availability of microdata creates new opportunities for joint exploitation of these resources for comparative research. Before joint analysis is possible, the data need to be harmonized, i.e. transformed and combined in a single data set in a way that matches the measurement of the same characteristics and standardizes the coding of responses. We introduce the retroharmonize and eurobarometer packages that help the reproducible ex-post harmonization of survey data stored in various, individual files with rich metadata. We illustrate the proposed workflow using the case of the Eurobarometer surveys, which provide rich data on European publics in form of single-wave SPSS files from a period of over 25 years. Our tools are flexible and can be applied to other survey projects.

The Eurobarometer case is particularly interesting, because each wave of the Eurobarometer project includes 27-30 nationally representative surveys conducted in about 20 languages based on a generic English and French questionnaire. They contain trend variables, which are asked repeatedly in many waves, so only joining the same variables of two Eurobarometer waves, for example, on trust in institutions, requires a harmonization of about 60 complex data assets. Within each wave, the surveys are harmonized ex ante, including fieldwork organization or processing of the interview data and combining them into a single data file. These procedures differ between waves, especially ones that took place far away in time, for example, in 1995 and 2019. When we are ex post joining data from different survey waves, we have to understand and partly reconstruct ex ante harmonization steps.

Ex-post data harmonization refers to procedures applied to existing data that have not been created in a way that ensures comparability, in a way that makes the resulting, harmonized, data suitable for joint analysis. In practice, ex-post harmonization means taking data as they were published by research projects and recording them to ensure that codes have the same meaning across all data sources.

In order for the harmonization to be reproducible, all data transformations and decisions must be documented with procedures that enable the reproduction of the end result. Reproducible recoding can be in form of code in some programming language, where running the code performs all the necessary data transformations. As the number of recodes and/or different source data sets increases, the code quickly becomes long and complicated and difficult to follow for anyone but the code’s creator. Instead, the reproducible workflow we propose has been designed with readability in mind, with documentation built into all steps and all procedures.

In simple words, when we are creating a multi-wave integrated data set from two (or more) Eurobarometer surveys stored in separate SPSS files, we are aiming to solve the following problems:

-- Importing data from proprietary statistical software formats and creating a faithful, practical representation of the data and metadata contents for later harmonization and join.

-- Ex-post harmonization

-- Creating a joint data table where each variable that represents the same class and metadata attributes; documenting all the changes and Providing methods that re-export the data files to SPSS, Stata, SAS.

Our workflow generally fits into the tidyverse.

The Eurobarometer surveys

The Eurobarometer is a biannual survey conducted by the European Commission with the goal of monitoring the public opinion of populations of EU member states and -- occasionally -- also in candidate countries. The surveys are administered via face-to-face interviews to samples (selected via multi-stage probability sampling) of adult populations of the respective countries. Sample sizes hover around 1000 respondents in most countries, and around 500 respondents in smaller nations. Target populations in the Eurobarometer are defined as the “[P]opulation of the respective nationalities of the European Union Member States and other EU nationals, resident in each of the 28 Member States and aged 15 years and over“. In non-EU countries, the target population includes “the national population of citizens and the population of citizens of all the European Union Member States that are residents in these countries and have a sufficient command of the national languages to answer the questionnaire“ [@EuropeanCommission2017].

Each EB wave is devoted to a particular topic, and most waves ask some "trend questions", i.e. questions that are repeated frequently in the same form. Especially these trend questions create unique opportunities for longitudinal research. Each wave is published as a separate file that contains responses from all countries that participated in that wave, i.e. all EU countries and sometimes also candidate countries. As a consequence, the data are ex-ante harmonized within waves, which makes cross-country comparisons readily possible, and also reduces the burden of ex-post harmonization by a factor of about 20 (the number of countries in an average EB wave).

TO given an idea of the size of the EB data sets, each file consists of between around 150 and 4000 variables, with an average of some 750 among the test files we used, and between 15,000 and around 65.000 cases corresponding to individual respondents.

Import files {#import}

Complex microdata, such as primary survey data, is often stored in a format used by statistical software. In our article, we use examples for SPSS’s .sav format, but other proprietary file formats pose similar challenges. These formats usually do not only represent the data, but also contain rich metadata that provide additional information to the researcher, such as which observations should be treated as missing, or what is the valid range of a variable's values.

In our case, we are dealing with highly structured survey data that contain a lot of metadata, partly about the ex ante harmonization process that is vital for using the data. For example, an archived Eurobarometer wave file from recent years contains more than 28 national samples (for each EU member state, plus two for Germany - East and West, two for the former member state United Kingdom - for Great Britain and Northern Ireland, and often for European Economic Area or candidate countries, such as North Macedonia or Albania.) The way these subsamples were merged together and handled must be preserved.

We almost fully rely in importing on two packages, haven and labelled from the tidyverse [cf. @R-haven; @R-labelled; @R-base]. They have created classes and methods that give a faithful representation of imported statistical software files, but they need to be amended for reproducibility and harmonization information for later joins. Our version of the retroharmonize::read_spss() function records the parameters of the haven::read_spss() call for reproducibility.

Ex-post harmonization {#expost}

Ex-post (or retrospective) data harmonization refers to procedures applied to existing data to improve the comparability and inferential equivalence of measures collected by different studies [cf. @Fortier2017a]. Ex-post survey data harmonization involves applying these procedures to survey data sets that were not created with comparability in mind or are otherwise not compatible enough to be analyzed as a single data source.

Ex-post harmonization is both theory-informed and data-driven. Theory provides concepts and definitions, while data availability to a large extent determines what ends up being harmonized and how. Additionally, methodological research provides evidence about limits of ex-post harmonization with regard to achieving data comparability.

While the exact steps vary by application case, a survey data harmonization project can be generally described as consisting of the following steps: (1) definition of the concepts of interest, (2) indentifying and obtaining the data, (3) data transformations, and (4) verification and documentation [cf. @Granda2016; @Wolf2016; @Fortier2017a; @Slomczynski2018; @Kolczynska2020a]. It is worth emphasizing that "[R]etrospective harmonization of individual participant data is an iterative process composed of a series of key closely related, inter-dependant and often integrated steps" [@Fortier2017b: 1].

The scope and complexity of the steps involved in survey data harmonization depends on the extent to which the existing data are already comparable with regard to the (a) sample and (b) measurement. The comparability of survey samples includes the definition of the target population, the sampling frame and sample type, as well as the response (or other outcome) rate. Measurement deals with the content and design of the survey questions, as well as the recording responses, data entry and data validation procedures applied following fieldwork. The more similar the procedures in both areas, the higher are the chances for a high quality end product.

In the case of the Eurobarometer, the harmonization process in greatly facilitated by the fact that all Eurobarometer waves are part of the same project with largely standardized procedures for many stages of the survey process, including questionnaire design, sampling, fieldwork, data processing and dissemination. Additionally, within each Eurobarometer wave, data for all countries are harmonized ex ante. Thus, in this case the harmonization of the data across multiple Eurobarometer surveys boils down to data data transformation and documentation. For the purposes of presenting the process of harmonization with retroharmonize and eurobarometer packages, we use the case of the Eurobarometer without the loss of generality of the described procedures, since it includes all data processing steps involved in harmonization.

Thesaurus, taxonomy {#taxonomy}

Harmonization of variable names {#harmonize-var-names}

While many EB waves include the same trend questions, the naming conventions of these variables and their variable labels have been changing and are not consistent across all waves. For example, questions about trust in parliament are labelled as "TRUST IN INSTITUTIONS: NAT PARLIAMENT" or "TRUST IN INSTITUTIONS: NATIONAL PARLIAMENT", where the label also typically contains at the beginning the number of the question in the questionnaire, which changes from wave to wave. Similarly, the question about Respondent's occupation is sometimes labelled as "RESPONDENT OCCUPATION SCALE" and other times "OCCUPATION OF RESPONDENT SCALE".

Identifying the same variables behind these different labels can be automated in cases when the labels contain the same words but in a different order or separated by different punctuation or stop words.

Another part of the standardization of variable names can be performed with a dictionary that maps abbreviations (often non-standard ones due to limits in the length of variable labels) to standardized terms. For example, 'gov', 'govnmt', and 'govmnt' are all variants of the term 'government'.

Only applying these two steps enable the standardization of a number of variables referring to socio-demographic characteristics (age, gender, education, occupation, household size, marital status, community type), well-being (life satisfaction), political attitudes and engagement (left-right placement, trust in institutions, discussing politics), and living standard (ownership of durables, internet use).

Note: We should keep an eye on snakecase and maybe janitor, too.

Harmonization of missing values {#harmonize-missing}

The harmonization of the value labels is a three-step process in our case. Because we work microdata that is not native R data, we must make sure that we have an unambiguous type casting of the variables that we want to join, we apply the same range of category or value labels, and we consistently treat missing values.

We start with the missing values, because they are treated differently in our main importing format, SPSS, and the R language. In SPSS, the user can defined certain values (and labels) to be treated as if they were missing, for example, in the calculation of an arithmetic mean or a median --- for example, do not know answers.

The haven package has already adopted the importing of user-defined missing values, or the user defined missing value range with the creation of the class haven_labelled_spss. While this class generally works well with the simpler haven_labelled methods, it does not have many implemented methods in haven. In our case, we had to make sure that when we concatenate two vectors with user-defined missing values or ranges, they will remain consistent after the join. Our retroharmonize_labelled_spss_survey class is an extension of the haven_labelled_spss. It relies on the vctrs package, which contains helper methods for creating new vector classes [cf. @R-vctrs], and which serves as the foundation of haven's own class system, too.

Our choice in this case is the that we apply integer values, long out of the range of categorical integer values in the range 99900:99999 for a consistent numeric coding of the values to be treated as missing, and then we label them consistently. The original missing labels and missing values as preserved as an attribute in our labelled_spss_survey class. We add missing labels to data that does not even have missing values for consistency.

The haven_labelled_spss does not have many implemented coercion and casting methods, because there is no objective way of handling the different missing values. We had to make some subjective and consistent choices to allow a consistent re-casting with as_numeric, as_factor and as_character methods. For full reproducability, before replacing the user-defined missing values with the type-specific NA values in the R language (NA_real_ or NA_character_), we carefully record the labels of these conversions, and even their pre-harmonization, historical labelling, too. After this conversion, R artithmetic and statistical functions can be applied without compromising reproducability --- all metadata is present for a consistent export to SPSS or STATA, or to document, or even revert to an earlier labelling.

The lack of clear coercion and casting disallows the concatenation or binding of haven_labelled_spss vectors in a data-frame. Our subjective choice of pushing all user-defined missing values to the numeric range of range 99900:99999 with the same labelling, and applying exactly the same coding and labelling of the valid range allows us to use vctrs::vec_c and vctrs::vec_rbind for binding the variables together.

SPSS and the labelled_spss class allows the parallel use of two attributes, na_values for enumeration of all values that should be treated as missing, or na_range for declaring a continuous range as missing. In surveying, the use of na_values is more practical, nevertheless, we must check the consistency of these attributes when we want to bind together several vectors. The na_range_to_values creates this consistency in case na_range is present in a vector. We did not implement the reverse na_values to na_range as it is unlikely to be used for survey harmonization. The function gives a warning in case of inconsistency --- in such a case we recommend a manual review of the valid and missing ranges of the vector, as this is more of a logical than a syntactical error in the survey file.

Harmonization of categories {#harmonize-labels}

The function harmonize_values() does three critically important steps:

Working with harmonize_values() is relatively easy if the researcher wants to join a few variables in a few survey files, but for large-scale, programmatic use it not really suitable. (See vignette)

We create various helper functions that allows a massive relabelling based on pre-defined harmonized value labels. [... later ....]

The retroharmonize package only deals with a simple pre-processing step, i.e. a soft normalization of the value labels. The use of value labelling is very typical for a survey program, such as Eurobarometer, and a data archive, such as GESIS, so we implement most of the value harmonization (and its helper functions) in the eurobarometer package. This implementation can be used as a template for other survey programs.

Documenting the process {#documentation}

Our class definitions partly extended the S3 classes of haven for reproducibility and for automatic documentation. The haven package is designed to work with data from a single external resource --- it does not provide function to join data frames from different sources, and it does not record the call parameters of its functions. This is not a handicap when the researcher works with a single data file, but in our case, we aim to join data from potentially dozens of SPSS files.

In the first step, we modified read_spss() from haven to record some of the call parameters [not yet all parameters are recorded, but the filename is], most importantly the source file's name. Since we aim to work with trust_in_eu_parlament variables from different sources, for harmonization and validation we must be able to reference their source. We also create a simple id attribute, which may be the name of the survey, or the file name, a DOI, or any unique identifier of the original file. The result is a simple tibble (data.frame-like) object of class [survey missing from documentation*], which has at least an id attribute. This attribute serves only identification and documentation purposes, therefore survey objects work exactly as tibbles in all situations.

In the next step, retroharmonize::harmonize_labels() changes the labels attribute and the base values of the labelled_spss() vector, and at the same time we record the original coding and labelling of the vector. For example, if we created the id ZA7976 for the respective Eurobarometer survey file, then the original labels will be stored in the attribute ZA7976_labels, and the original (character, integer or double) values in ZA7076_values.

When this vector is concatenated with the same vector of observations from ZA6968 survey, then apart from sharing the same labels and missing value ranges, the new vector will carry ZA7976_labels, ZA6968_labels, ZA7976_values and ZA6968_values. The flexible metadata system of R allows us to extend this further when we are concatenating further and further observations from new files.

Two simple helper functions support the user to retrieve this information easily --- in fact, to print an entire codebook of the harmonization process.

The document_survey_item function exports the entire history of the harmonization process along a codebook for the variable. ... the looping version is not yet desinged....

Joining the data {#datajoin}

At this stage, we are ready to have rich data tables with many metadata attributes that are ready to be joined in R. We have kept all information that is necessary for the later step, i.e. the actual statistical analysis of the data in a way that it still can be done in SPSS, Stata, or in R.

At this point the new data table is ready for analysis. We provide some methods for exporting them (back) to SPSS or Stata, or flat tables, before turning to analyzing the data in R.

The advantage of defining a new S3 type with rather strict coercion rules is that after harmonization and validation, it allows for a safe and quick joining of the variables as vector with the vctrs::vec_c() or even the base R c() method, and a fast rbind() method.

In our experience, this is usually not possible with SPSS files imported by haven alone. Haven's two classes, labelled and labelled_spss cannot be concatenated, because it is not guaranteed that the missing values are similarly defined. If in a Eurobarometer wave the person creating the final SPSS version did not explicitly define missing values, but in a later wave it was defined, the data join will fail even if the integer coding and the labelling is perfectly matching. This is the reason of for creating our own coercion and recasting method, which is very strict, and defines a missing range in the labelled variables even if there are no missing values present.

However, in our experience, the lack of user-defined missing values does not mean that they should not be defined. There is no reason why certain Eurobarometer surveys indicate do not know answer options as a missing value category while others not. We implement a subjective solution for their harmonization in eurobarometer. Our implementation can be easily overridden.

Joining entire survey objects, i.e. identified tibbles, would not be feasible, as they may have any kind of variables with unknown joining consequences. To avoid concatenating by vectors, i.e. simple questionnaire items, we created methods and functions to join question blocks - sufficiently similar parts of survey files.

Naturally, surveys can be binded together if the same variables have the same names, and if they are labelled with categories or missing values, the labelling is consistent. This element of the workflow is rather idiosyncratic for the survey program in question. We devided the implementation to provide the joining, binding infrastructure in retroharmonize, and to provide the variable name, label harmonization vocabulary and validation in the survey-specific eurobarometer.

Methods for exporting {#exportmethods}

Our labelled_spss_survey is a bit extended labelled_spss class. It can be exported to SPSS or Stata with information loss. The lost information is the metadata of the harmonization history, which can be separately exported for documentation purposes.

Unfortunately, the C++ code that writes from haven to sav (behind write_sav) is not implemented for the haven_labelled_spss class and does not record back user-defined NA values. I was not able to amend the code, but I wrote a detailed bug report.

Currently, for testing and demonstration purposes, we must rely on the original GESIS files, or, if we want to show smaller or integrated examples, their rds representations.

See read_rds for the concept, the write_rds pair is not yet implemented, but easy to imagine. My read_spss does what I imagined, but I cannot modify the haven::write_sav in C++ so until it is fixed, we cannot rely on that.

End of comments.

Methods for analyzing in R {#rmethods}

As Joseph Larmarange well described in Introduction to labelled (see: [cf. @R-labelled]), working with non-native microdata may have two logical workflows, out of which we chose workflow B. This separates the data processing work from analysis, and therefore allows to re-export the processed data (with some information loss) to SPSS or Stata.

In the R language, almost all arithmetic and statistical functions work with two input types: numeric variables (integers or doubles) and factors (sometimes simplified to character). In order to use our harmonized data in R, we must provide methods to bring them to these types in a consistent way --- making sure that factor levels are harmonized and missing values are treated consistently.

Our simple methods, as_factor() (imported from haven) and as_numeric() precisely do this. While we would not advocate the use of character vectors for statistical analysis, for data visualization they may be more practical, and as_character() will display the value labels as a character vector.

Our package implements simple arithmetic methods, such as median, quantile, mean, sum, weighted mean, for providing useful summary, printing and documentation methods, and for weighting without metadata loss. These methods are not exported, because we expect that the R user will analyze the data with their numeric or factor representations.

References



antaldaniel/eurobarometer documentation built on Aug. 31, 2020, 10:57 p.m.