In antaldaniel/surveyreader: Data cleaning and integration for survey files

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

My starting point for using the GESIS archive is the (latest) SPSS file. Using SPSS files programmatically in R have some problems because of the variable concept of SPSS. The package relies on haven to read and handle SPSS files.

Variable names

In SPSS, each variable described by a short name (code, like D41) and a variable label or description. None of these are very useful in R. The short versions do not contain information, and vary from file to file. The long ones have spaces and special characters that cannot be handled in program code. So the package creates new variable names, following the naming recommendation of ROpenSci. The variable names are snake_format_variables, almost always created from the SPSS labels, with a few exceptions.

All special characters that may cause problems in regular expressions (regex) or programmatic approaches are removed, so, for example, the REDUCE GREENHOUSE EMISSIONS BY 20% becomes reduce_greenhouse_emission_by_20pct in the R data frame.

Multiple choice variables

Multiple choice variables can be reproduced in several columns. They are binary per column (the answer option was mentioned, i.e. selected or not mentioned, i.e. not selected.). To clearly indicate that the variable is a subvariable and in some cases must be analyzed with the rest of the question, all these variable names start with the prefix mc_. mc_nationality

Class representation

Haven created a useful class for SPSS variables, which contains a value and a value label, such as 1 – yes, 2 – no, 3 – Refusal. These labelled class is however not very useful in most statistical applications or visualizations, so the package, after reading in the SPSS file into labelled variables, further transforms them into atomic R variables: numeric, factor and in some cases character variables. Numeric variables are always treated as numeric, and non-repeating variables are treated as factors. Repeating variables can be rescale and releveled in consistent, uniform way.

Uniform factor levels

In order to facilitate use of multiple files in an analysis, the package reads in labelled classes, which may be differently labelled across SPSS files, and creates a uniform description for them. For example, the yes, no, refusal factor levels will be always the same.

Rescaling the variables

In order to facilitate data integration with other GESIS files, and other survey results, a uniform rescaling is offered for all repeating answer options. Depending on the further statistical or visualization task, a numeric, factor or character representation may be desirable for the answers. For each regularly repeating question type there is wrapper function that helps to rescale the variables in a consistent, uniform way. * A character representation may be desired when translating to other languages, or in some visualization or printing problems. The use of space or underscore may be desirable, so rescaling functions offer the underscore option (by default, always FALSE).

A factor representation is the most adequate in most of the cases.
In some cases, a numeric representation is more desirable, but it must be kept in mind, that apart from alphanumerical variables, all the categorical variables are ordinal or nominal variables. Nevertheless, especially with binary variables, a binary 1-0 representation makes work easier. Numeric representation also makes the integration of various natural language survey data easier. For example, the gender description for female respondents in many languages requires special characters, which makes a programmatic data integration very tedious.

antaldaniel/surveyreader documentation built on May 16, 2019, 2:29 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com