knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(retroharmonize)
Survey data harmonization refers to procedures that improve the data comparability or the inferential capacity of multiple surveys. Ex ante survey harmonization refers to planning and design steps to make sure that not yet answered questionnaires can be better compared, or data derived from them joined, integrated. Such procedures include the harmonization of the questionnaire, the harmonization of the sample design, and other aspects of carrying out multiple surveys. Ex post or retrospective harmonization refers to procedures to data that has been derived from surveys---i.e., survey that have been carried out.
Naturally, better ex ante harmonization makes eventual data integration or data comparison easier; yet often we can still harmonize retrospectively survey data that has not been carefully pre-harmonized before respondents have answered the questionnaire items.
Our aim with the retroharmonize
R package is to provide assistance to a reproducible research workflow in carrying out important computational aspects of retrospective survey harmonization.
Let's start with a very simple example.
library(labelled) survey_1 <- data.frame( sex = labelled(c(1,1,0, NA_real_), c(Male = 1, Female = 0)) ) attr(survey_1, "id") <- "Survey 1" survey_2 <- data.frame( gender = labelled(c(1,3,9,1,2), c(male = 1, female = 2, other = 3, declined = 9)) ) attr(survey_2, "id") <- "Survey 2"
library(dplyr, quietly = TRUE) survey_1 %>% mutate ( sex_numeric = as_numeric(.data$sex), sex_factor = as_factor(.data$sex))
The ordering of the survey harmonization workflow is flexible, and it is likely that even the same researcher would choose a different workflow in the case of smaller, simpler harmonization tasks and more complex harmonization tasks.
The data science aspect of a successful survey harmonization task is the creation of a consistent data frame that contains harmonized information from multiple surveys. It practically means that questionnaire items are mapped into variables with a consistent numerical coding, descriptive metadata (variable and value labels) and a consistent handling of missing and special values. This may be very laborous task when surveys are conducted in different years, saved in different file formats with a different metadata structure, missing and special values are handled differently, and the metadata contains potentially different natural language descriptions or spelling.
Survey 1
labels the sex of respondents as Male
and Female
, and has cases that are neither Male
or Female
, but we do not know why.
survey_2 %>% mutate ( gender_numeric = as_numeric(.data$gender), gender_factor = as_factor(.data$gender))
Survey 2
records gender, which contains the same information as sex in Survey 1
(Male
and Female
), but allows people to identify as Other
, and labels cases when people decline to identify with any of these three categories.
In practice, you want to end up with the following joined representation of your survey:
survey_joined <- data.frame( id = c(1,2,3,4,1,2,3,4,5), survey = c(rep(1,4), rep(2, 5)), gender = labelled(c(1,1,0,9, 1,3,9,1,0), c(male = 1, female = 0, other = 3, declined = 9)) ) survey_joined %>% mutate ( id = paste0("survey_", .data$survey, "_", .data$id), gender_numeric = c(1,1,0,NA_real_, 1,3,NA_real_,1,0), gender_factor = as_factor(.data$gender), is_female = ifelse (.data$gender_numeric == 0, 1, 0))
survey_1
with survey_2
, or, we want to concatenate survey_1$sex
with survey_2$gender
.survey_1$sex
may come with a variable label something like SEX OF RESPONDENT, and survey_2$gender
may be labelled as GENDER IDENTIFICATION. This label should be harmonized to Sex or gender or the respondent. survey_2$gender
coded with a numeric 2 must be changed to a numeric 0.survey_1$sex
Female respondents and survey_2$gender
female respondents will be consistently labelled as female.survey_1$sex
and survey_2$gender
can be technically concatenated, but before harmonization this will create logical errors, because females will be either coded with 0 or with 2. The as_numeric()
and as_factor()
methods of our labelled_spss_survey class handle consistency issues.data.frame()
. It contains various descriptive metadata about the survey among attributes.The joining of the not harmonized datasets results in the following data frame.
library(dplyr) survey_1 %>% mutate ( survey = 1, sex_numeric = as_numeric(.data$sex), sex_factor = as_factor(.data$sex)) %>% full_join( survey_2 %>% mutate ( survey = 2, gender_numeric = as_numeric(.data$gender), gender_factor = as_factor(.data$gender)) )
Performing only variable harmonization yields to a data frame that has the correct dimensions, but it is not usable for statistical analysis.
library(dplyr) survey_var_harmonized <- survey_1 %>% rename ( gender = .data$sex ) %>% mutate ( survey = 1, gender_numeric = as_numeric(.data$gender), gender_factor = as_factor(.data$gender)) %>% full_join( survey_2 %>% mutate ( survey = 2, gender_numeric = as_numeric(.data$gender), gender_factor = as_factor(.data$gender)), by = c("gender", "survey", "gender_numeric", "gender_factor") )
Apart from the simple, descriptive variable of the survey identification, non of the descriptive statistics are meaningful.
summary(survey_var_harmonized)
The variable labels must be harmonized for a successful factor representation. The numerical coding must be harmonized, and the missing cases must be consistently handled to achieve any useful numerical representation.
survey_joined %>% mutate ( id = paste0("survey_", .data$survey, "_", .data$id), gender_numeric = c(1,1,0,NA_real_, 1,3,NA_real_,1,0), gender_factor = as_factor(.data$gender), female_ratio = ifelse (.data$gender_numeric == 0, 1, 0)) %>% summary()
The data importing functions make sure that survey data and metadata are carefully translated to R data classes and variable types.
The metadata functions help the analysis, normalization and joining of the metadata aspects (variable and value labels, original variable names, unique identifiers) across surveys.
Harmonization functions help the harmonization of responses to questionnaire items, i.e. making sure that coded values, the labelling of values, and missing data are handled consistently across multiple surveys.
Our package was tested on multiple, international, harmonized surveys, particularly the Eurobarometer, the Afrobarometer and the Arab Barometer survey programs. Different users, and different task call for different workflows. We created a number of helper functions to assist various workflows.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.