This is an Rmd version of the partially filled out .docx template (in the same data-raw/JORS/ subfolder). It is easier to work with this, the .docx should only be used for the final submission.

knitr::opts_chunk$set(echo = TRUE)

Software paper for submission to the Journal of Open Research Software

To complete this template, please replace the blue text with your own. The paper has three main sections: (1) Overview; (2) Availability; (3) Reuse potential.

Please submit the completed paper to: editor.jors@ubiquitypress.com

Overview

Title

The title of the software paper should focus on the software, e.g. “Text mining software from the X project”. If the software is closely linked to a specific research paper, then “Software from Paper Title” is appropriate. The title should be factual, relating to the functionality of the software and the area it relates to rather than making claims about the software, e.g. “Easy-to-use”.

Paper Authors

  1. Antal, Dániel; (Lead/corresponding author first)
  2. Last name, first name; etc.

Paper Author Roles and Affiliations

  1. First author role and affiliation
  2. Second author role and affiliation etc.

Abstract

A short (ca. 100 word) summary of the software being described: what problem the software addresses, how it was implemented and architected, where it is stored, and its reuse potential.

Survey data harmonization refers to procedures that improve the data comparability or the inferential capacity of multiple surveys conducted in different periods of time, or in different geographical locations, potentially using different languages. The retroharmonize package support various data processing, documentation, file/type conversion aspects of various survey harmonization workflows.

Keywords

survey data; survey harmonization; open data Keywords should make it easy to identify who and what the software will be useful for.

Introduction

An overview of the software, how it was produced, and the research for which it has been used, including references to relevant research articles. A short comparison with software which implements similar functionality should be included in this section.

Surveys, i.e., systematic primary observation and data collections are important data sources of both social and natural sciences. Survey data harmonization refers to procedures that improve the data comparability or the inferential capacity of multiple surveys conducted in different periods of time, or in different geographical locations, using different languages. Some surveys are intended to harmonized, and ex ante harmonized with using a similar survey design, including questionnaire, others can potentially be ex post or recycled harmonized if they employed sufficiently similar methodology and asked questions and coded in a similar way. In our experience, data output from survey programs such as Eurobarometer, which is designed to be harmonized and has been running for almost 50 years require a very significant ex post or retrospective harmonization effort to yield harmonized output (such as comparable mean or median statistics, comparable indicators derived from the survey data.)

The use of surveys is so widespread in various scientific domains that survey harmonization has no single recognized interdisciplinary methodology or even terminology. The retroharmonize R package support various data processing, documentation, file/type conversion aspects of survey output harmonization. Our approach is retrospective, because we take it granted that they survey has been conducted and the responses were collected into a single, coded data file. Our software has been tested on international policy surveys (xxxx) and Cultural Access and Participation Surveys. Even though these surveys seem to be rather homogenous, their data was collected in about 70 natural languages, over the timespan of about 30 years and stored following different data management principles, and software given in different countries and periods in time. Our approach therefore needed to be sufficiently generalized that we can confidently make the claim about retroharmonize's general applicability in a wide range of social, medical and natural sciences that require harmonized output from surveys that employed standard questionnaires or textbooks.

The first aim of a survey harmonization project is to create a harmonized dataset. The harmonized dataset must join sufficiently similar variables in a way that ....

Our workflow is centerred around reproducibility in re-coding. We have built on previous work that was designed to for single input (such as importing a single survey's data) and amended it in ways that allowed us to build harmonized output.

  1. Importing: We have modified file input functions, particularly from haven:: that are designed to read in one SPSS or STATA file (the most frequently used format for coded survey data), CSV files and saved R objects in RDS. Our inherited survey class from the tidyverse tibble (which is a modernized version of the data frame) contains metadata that can help subsequent harmonization steps. Our inherited labelled_spss_survey extends the methods of haven's labelled_spss, in itself derived from labelled::labelled with checking for the consistency of labels, particularly special value and missing value labels across files. Even the SPSS or STATA files published by xxxxx about the ex ante harmonized Eurobarometer files have serious inconistencies in the coding and labelling of questionniare items across surveys taken in different years (but not within year.) This makes the concatenation of labelled variables, or data frames containing such variables either logically, or often sytanactically impossible.
  2. Concept harmonization: We created functions that xxxx.

  3. Crosswalk: we use a simple crosswalk schema, or crossswalk table to define variable harmonization, variable code and label harmonization, and R type conversion steps in the output harmonization. There are several R packages that are tacking a similar problem.

  4. Type conversion: When the harmonized output is not simple a harmonized dataset, but a harmonized statistic or a harmonized indicator, the harmonized dataset needs further statistical processing. To unleash the vast potential of R's statistical package ecosystems, it is important to harmonize the data into R's base classes, numeric for numerical statistics and factor for categorical statististc. (Often the same data can be given both numeric and factor representation.) Our inherited tidyverse s3 classes, survey and labelled_spss_survey were designed to make these procedures well documented, reproducible, and unambiguous. The most frequently used file format for survey data is SPSS, which uses a code/label representation that, when imported without our added consistency checks, can be coerced to numerical or factor formats that result in serious logical errors. (This is mainly due to the cause that responses which must be translated to NA_real_ in numeric format are always coded as integers in SPSS--and their interger values and labels indicating missigness are not harmonized.)

The R package crosswalkr strongly overlaps with our approach to crosswalk, but critically, it does not work with SPSS files. The data and metadata from SPSS cannot be directly mapped to R's base classes (hence the conversion to labelled, labelled_spss, and our labelled_spss_survey). Our approach to crosswalk basically replicates the same functionality, but adding separate steps for the harmonization of special (missing) labels and selection of type conversion to numeric or factor representation.

The DDI Alliance has released the Structured Data Transformation Language (SDTL) has been designed for standardising an intermediate language for representing data transformation commands. Statistical analysis packages (e.g., SPSS, Stata, SAS, and R) provide similar functionality, but each one has its own proprietary language. Since the first, peer-reviewed CRAN release of retroharmonize, the DDIwR package has been developed, but not yet released on CRAN and not yet fully documented. DDIwR solves similar problems that we solved with the introduction of the inherited s3 class labelled_spss_survey with a far more general and ambitious goal. We foresee that in the future we will create full interoperability with that package, and indirectly with the survey harmonization efforts of the international DDI Alliance.

The software was developed and tested with social sciences surveys using questionnaires, but various other surveying modes (inflation surveys with price scanning, […]) could be harmonized with our approach.

Various R packages support aspects of the workflow of survey harmonization.

Importing data from survey files (containing numeric codes and value labels), harmonizing concepts, then variables, labels, and coding, and eventually bringing them to the same variable types for binding or joining into a single, tidy dataset. The haven package in the tidyverse imports single SPSS and STATA files (which are almost always used for social sciences surveys.) The haven package is in turn builds on the labelled package for using variables that have both numeric and labelled representations. Whilst haven and label work perfectly with single surveys, they do not crosscheck the potential conflicts of conflicting labels, particularly conflicting special and missing value labels across files. We created inherited classes from these packages that create truly unique identifiers across several files and contain methods that prevent type conversion logical or syntax errors with inconsistent labelling.

“Data is only potential information without metadata”. This statement is painfully clear when you work with several surveys, which may contain the measurement of the same concept in differently named variables, held in different file types, using different numerical codes, value labels, and conflicting special characters. Retroharmonize extents the tidyverse packages for consistently mapping the imbedded metadata of various SPSS, STATA or even CSV files, and use this information for systematically change the names of variables, xxxx,xxxx.

There are several R packages that do a similar job, but have a less ambitious aim. Xxxxx., xxxxxx, xxxxx.

Implementation and architecture

How the software was implemented, with details of the architecture where relevant. Use of relevant diagrams is appropriate. Please also describe any variants and associated implementation differences.

Retroharmonize was developed over several years with implementing more and more harmonization tasks in various ex ante harmonized surveys, starting with the European Eurobarometer series, then adding Afrobarometer, Lationbarometro and private surveys. These international survey research programs provide access to their harmonized surveys in “waves”. Usually, they call a way a set of ex ante harmonized surveys (containing the same questionnaire in several languages) in one data collection period. Our added value has been that we further harmonize data among waves (when data is not fully ex ante harmonized and requires ex post or retrospective harmonization.)

Quality control

Detail the level of testing that has been carried out on the code (e.g. unit, functional, load etc.), and in which environments. If not already included in the software documentation, provide details of how a user could quickly understand if the software is working (e.g. providing examples of running the software with sample input and output data).

Retroharmonize was extensively tested on privately conducted surveys, and three large, international, ex ante harmonized survey programs (questionnaire-based social science research aimed for ex post or retrospective harmonization across countries and years): Eurobarometer, Afrobarometer and Latinobarometro.

Our aim was to create a package that can accompany a social scientist working with surveys on a personal computer. We soon realized that working with ex ante harmonized surveys may potentially lead plenty of resources, particularly because the typical file format used for surveys, SPSS, due to its dual data-metadata coding, is not very efficiently imported with tidyverse's haven to R. All important functions were designed to work either with a list of surveys being documented, subsetted, renamed, recoded, or sequentially. These functions can take an optional survey_paths (full path) or survey_path and import_path (directory) input, in which case each task is performed sequentially. The optinal export_path, when given, saves the sequentially intermediate or final outputs with saveRDS as R objects.

Our aim was to create a package that can accompany a social scientist working with surveys on a personal computer. We soon realized that working with ex ante harmonized surveys may potentially lead plenty of resources, particularly because the typical file format used for surveys, SPSS, due to its dual data-metadata coding, is not very efficiently imported with tidyverse's haven to R. All important functions were designed to work either with a list of surveys being documented, subsetted, renamed, recoded, or sequentially. These functions can take an optional survey_paths (full path) or survey_path and import_path (directory) input, in which case each task is performed sequentially. The optinal export_path, when given, saves the sequentially intermediate or final outputs with saveRDS as R objects.

This allows a much faster looping when sufficient memory is present, or a slower looping over files. We also included simple functions for resource planning, and tutorials to show the optimal workflow (usually subsetting of many SPSS files should be done sequentially but the later stages of harmonization can take place in memory.)

For unit testing, we included in the R package three subsets of published Eurobarometer surveys. The package’s unit testing consists of about 130-unit tests made with this real-life survey excepts.

Availability

Operating system

The retroharmonize R package is tested to run on several different operating systems. According to Microsoft, R 3.5.0 is tested and guaranteed to run on the following platforms: Windows® 7.0 SP1 or later, Ubuntu 14.04 or later, CentOS / Red Hat Enterprise Linux 6.5 or later, SUSE Linux Enterprise Server 11 or later, Mac OS X El Capitan (10.11) or later macOS versions.

Programming language

The retroharmonize R package depends on R version 3.5.0 or higher.

Additional system requirements

According to Microsoft, minimum system requirements for R 3.5.0 are 64-bit processor with x86-compatible architecture, 250 MB of free disk space and at least 1 GB of RAM. These requirements are met by most computers sold in the last 10 years.

On more modern R versions, the package is tested to run on a wide variety of operating systems and system configurations, including ARM-based Macs.

Dependencies

The retroharmonize R package depends only on R (version 3.5.0 or greater). The package imports functions from the following packages: R Core packages: methods, stats, utils; tidyverse packages: dplyr (1.0.0 or greater), glue, haven, magrittr, stringr, tibble, tidyr, purrr; R infrastructure (r-lib) packages: fs, here, pillar, rlang, tidyselect, vctrs; and other R packages: assertthat, labelled, snakecase

The retroharmonize R package is practically a very thorough extension of the R tidyverse packages: it depends on haven (and labelled) for working with coded survey files. It uses dplyr, tidyr (and their common, deep level rlang, vctrs) dependencies for variable manipulation within a single survey (preparation for harmonization) and purrr for functional programming task with several surveys.

List of contributors

Please list anyone who helped to create the software (who may also not be an author of this paper), including their roles and affiliations.

Marta Kolcynska () as a survey harmonization expert contributed to the conceptual development of the first documentation, building the first use cases and exploring various survey harmonization workflows that may need a reproducible and computational support.

Software location:

Archive (e.g. institutional repository, general repository) (required – please see instructions on journal website for depositing archive copy of software in a suitable repository) Name: CRAN Persistent identifier: https://CRAN.R-project.org/package=retroharmonize Licence: GPL-3 Publisher: Daniel Antal Version published: 0.2.0 Date published: 02/11/21 Code repository (e.g. SourceForge, GitHub etc.) (required) Name: retroharmonize Identifier: https://github.com/rOpenGov/retroharmonize Licence: GPL-3 Date published: 15/12/21 Emulation environment (if appropriate) Name: The name of the emulation environment Identifier: The identifier (or URI) used by the emulator Licence: Open license under which the software is licensed here Date published: dd/mm/yy

Language

Language of repository, software and supporting files

English

Reuse potential

Please describe in as much detail as possible the ways in which the software could be reused by other researchers both within and outside of your field. This should include the use cases for the software, and also details of how the software might be modified or extended (including how contributors should contact you) if appropriate. Also you must include details of what support mechanisms are in place for this software (even if there is no support).

The retroharmonize R package aims to provide a versatile support for various survey harmonization workflows. Because surveys are so fundamental to quantitative social science research and play an important role in many natural science fields, not to mention commercial applications of market research or pharmaceutical research, the package’s main reuse potential is to be a foundation of further reproducible research software aimed to automate research and harmonization aspects of specific survey programs.

The authors of this package started the development work to be able to harmonize surveys from harmonized data collections of the European Union: namely the Eurobarometer and AES surveys programs. After working with various surveys (also outside these programs) it became clear that retroharmoinze should aim to be a common demoninator to a family of similar software that solves more specific problems. The world’s largest and oldest international public policy survey series, Eurobarometer. This program alone has conducted already thousands of surveys in more than 20 natural languages over more than 40 years, following various documentation, data management, coding practices that were not independent of the software tools available over this long period of time. The first verion of retroharmonize was separated to the retroharmonize and the eurobarometer R packages – retroharmonize providing a more general framework that has been able to serve Eurobarometer’s, Afrobarometer’s and the Arab Barometer’s different needs.

In our view, the retroharmonize package has the potential to become a general supporting software for more specific codes aimed at harmonizing surveys based first on questionnaires, later on different data inputs, such as price scanning, laboratory tests, and other standardized, discrete inputs that are carried out in different locations, with different recording tools, and with different coding (for example, because of natural languages differences, as it is the case in the social science surveys used for the testing of our software.)

Acknowledgements

Please add any relevant acknowledgements to anyone else who supported the project in which the software was created, but did not work directly on the software itself.

Funding statement

There was no funding available for the development of this software.

Competing interests

“The authors declare that they have no competing interests.”

References



antaldaniel/retroharmonize documentation built on Dec. 11, 2023, 10:49 p.m.