In beckerbenj/eatFDZ: Data checking and validation

This vignette illustrates how the Research Data Centre (FDZ) at the Institute for Educational Quality Improvement (IQB) cleans and tidies data for secondary usage purposes. For secondary usage, data is expected to be of high quality, traceable, comprehensible, and anonymized. Furthermore, meaningful metadata should enable secondary users to understand the data. For this purpose, operations such as recoding of identifier variables, emptying variables, changing labels, creating grouping variables, and many more are performed and therefore discussed in this vignette. Overall, the following aspects should be adressed when cleaning data:

Removing all non-ASCII characters
Setting appropriate missing tags
Adding non-informative identifier variables
Converting strings to labeled numeric variables
Anonymization of variables
Adding variable and value labels

knitr::opts_chunk$set(echo = TRUE, eval = TRUE)

The setup for this illustration is the following: ``` {r path}

if necessary, install eatGADS and eatFDZ

install.packages("eatGADS")

remotes::install_github("beckerbenj/eatFDZ")

load eatGADS

library(eatGADS)

path to example data set

sav_path <- system.file("extdata", "example_data.sav", package = "eatFDZ")

`eatGADS` is the R package used by the FDZ at IQB for data cleaning purposes. It is available on [CRAN](https://cran.r-project.org/package=eatGADS) and extensively documented via vignettes (e.g., the [Comprehensive Data Cleaning Guide](https://beckerbenj.github.io/eatGADS/articles/data_cleaning.html)).

For illustrative purposes, a small, artificial SPSS example data set of secondary school students in Berlin is used. The data set is intentionally flawed and messy. The example data set is part of the `eatFDZ` package and installed alongside. 

## Loading the data
```r
dat <- import_spss(sav_path)
str(dat$dat) # for a small overview in the variables of the data set

The data set is loaded via the import_spss() command. The data set contains 14 variables with 10 observations, including demographic variables such as age and sex of the students, their family situation and selected grades.

Note that the data set is imported as a GADSdat object. For detailed explanations on the corresponding object structure and technical details, we refer readers to the eatGADS documentation.

Removing all non-ASCII characters

Non-ASCII characters frequently lead to issues depending on the operating system and statistical software used. If your data contains non-ASCII characters, it might cause compatibility issues or errors when processing the data. fixEncoding() removes non-ASCII characters from a character vector or a GADSdat object.

dat$dat$ID_name
dat <- fixEncoding(dat, input = "other")
dat$dat$ID_name

The non-ASCII character in the ID name has been changed to ae.

Defining missings semi-automatically

A crucial meta data property for SPSS data are missing tags. Sometimes not all missing values are correctly tagged as missing. Via eatGADS, missing tags can be corrected semi-automatically. This can either be done based on wordings of existing value labels (checkMissings()) or based on a value range of existing value labels (checkMissingsByValues()).

checkMissings() tests whether a set of specified regular expressions occurs in the value labels and compares the corresponding missing tags. For instance the variable books contains a value label omitted, which has not been tagged as missing.

extractMeta(dat, vars = c("books")) # usually, we wouldn't know yet that this variable contains these value labels. This is for demonstration purposes to see the before and after effect of the function
missinglabels <- paste("missing", "unknown",
                       "omitted",
                       sep = "|")
dat1 <- checkMissings(dat, missingLabel = missinglabels)
extractMeta(dat1, vars = c("books"))

checkMissings() has now automatically tagged the value -99 with the label omitted as missing (miss).

checkMissingsByValues() checks whether value labels exist for a specified range of numeric values and whether missing tags are provided accordingly:

extractMeta(dat1, vars = c("school")) # usually, we wouldn't know yet that this variable contains these value labels. This is for demonstration purposes to see the before and after effect of the function
dat2 <- checkMissingsByValues(dat1, missingValues = -50:-99)
extractMeta(dat2, vars = c("school"))

The previous function had already tagged two of the missings correctly, but the value label -98 with the value label unclear answer was not used. Now, it is contained in the specified range and therefore tagged as a missing.

Defining missing values by hand

At times, only individual missings need to be tagged. changeMissings() tags missing values as missings by specifying them under value and marks them as miss under missings.

extractMeta(dat2, vars = c("info"))
dat3 <- changeMissings(dat2, varName = "info", value = c("-97", "-98", "-99"), missings = c("miss", "miss", "miss"))
extractMeta(dat3, vars = c("info"))

The values -97, -98 and -99 are now tagged as miss.

Changing the missing tags is also possible for multiple variables at the same time using changeMissings().

extractMeta(dat2, vars = c("info", "career"))
dat3 <- changeMissings(dat2, varName = c("info", "career"), value = c("-97", "-98", "-99"), missings = c("miss", "miss", "miss"))
extractMeta(dat3, vars = c("info", "career"))

Here, the values for both variables were simultaneously tagged as missings.

Recoding IDs

ID variables are recoded so no hidden identifiers such as state, school or even name codes allow conclusions to be drawn about individual persons. To ensure that the IDs remain the same across multiple subsets and describe the same person, they are recoded using a template. First, a path for the template is created.

f <- tempfile()
ID_recode_template <- file.path("f") # use your own file path instead of the tempfile

Then there are two options: Creating a new template (option 1 - for the first subset) or applying an existing template (option 2 - for the following subsets).

Option 1: Recoding IDs with creation of a template as .csv

autoRecode() creates a recoding template for the variable ID, saved as a .csv file, and in the course of this a new variable with the variable suffix _FDZ and an explanatory label suffix.

dat3 <- autoRecode(dat3, var = "ID", var_suffix = "_FDZ", csv_path = ID_recode_template, label_suffix = " (recoded by FDZ)")

Option 2: Recoding IDs with application of a template (e.g. created by a previously prepared, different data set).

Already existing templates can be used for following subsets. read.csv() imports the template, which can be applied with autoRecode()

template <- read.csv(ID_recode_template)
dat3 <- autoRecode(dat3, var = "ID", suffix = "_FDZ", csv_path = ID_recode_template, template = template,
                   label_suffix = "(rekodiert FDZ)") # Apply template and overwrite

removeVars() deletes variables from the data set. In this case, the old ID variable is deleted.

dat3 <- removeVars(dat3, vars = c("ID"))

Converting string variables to numeric variables

For (semi)open variables there are two procedures. If it is possible (e.g., for a small data set) to go through all responses to check whether there are no strong privacy violations or insults included, the variables are converted into numeric ones and then emptied in the Scientific Use Files. Thereby, the information remain accessible on a meta-level, but can no longer be assigned to a person. If there are too many responses to check or privacy cannot be granted, the variable is emptied entirely in the Scientific Use Files. To convert the string variables into numeric ones, all string variables are identified first.

all_types <- do.call(rbind, lapply(namesGADS(dat3), function(nam) data.frame(varName = nam, type = class(dat3$dat[[nam]]))))
all_types[all_types$type == "character", ]

In this case the variables ID_name, info, subjfav and birth were recognized as string variables. In the next step they are converted.

Option 1: Converting all string variables automatically (overwrite variables)

multiChar2fac() converts the string variables identified in advance into numeric variables using a for loop.

charNames <- all_types[all_types$type == "character", "varName"]
for(charName in charNames) {
  dat4 <- multiChar2fac(dat3, vars = charName, var_suffix = "", label_suffix = "")
}

Option 2: Converting string variable individually

multiChar2fac() converts the string variables named in vars individually into numeric variables. If the option var_suffix is used, a new variable is created.

dat4 <- multiChar2fac(dat4, vars = c("ID_name", "info", "subjfav", "birth"), var_suffix = "_FDZ", label_suffix = "")

If the var_suffix option is left blank, the existing variable is overwritten.

dat4 <- multiChar2fac(dat4, vars = c("ID_name", "info", "subjfav", "birth"), var_suffix = "", label_suffix = "")

Emptying variables

Variables that may not even be offered via a remote computing system due to strong data protection concerns are emptied for all data sets. emptyTheseVariables() empties the listed variables and adds an explanation to the variable label on why the information was removed.

extractMeta(dat4, vars = c("info", "ID_name"))
dat4_empty <- c("info",
                "ID_name")
dat5 <- emptyTheseVariables(dat4,
                            vars = dat4_empty,
                            label_suffix = " (Zur Anonymisierung geleert (FDZ))")
extractMeta(dat5, vars = c("info", "ID_name"))

dat4 corresponds to the data set, vars lists the variables to be emptied and label_suffix automatically adds the explanation to the variable label indicating that the data has been emptied for anonymization reasons.

Changing variable and value labels

Variable and value labels help us with the traceability and comprehensibility of a data set. ChangeVarLabels() selects the variable to be changed with varName and assigns the corresponding label with varLabel. ChangeValLabels() works in a similar way, the corresponding value is selected in value and the label for this value is created with valLabel.

Changing variable labels

extractMeta(dat5, vars = c("school"))
dat6 <- changeVarLabels(dat5, varName = "school", varLabel = "School type")
extractMeta(dat6, vars = c("school"))

Changing value labels ``` {r changeValLabels} extractMeta(dat6, vars = c("books")) dat7 <- changeValLabels(dat6, varName = "books", value = 5, valLabel = "201-500 books") extractMeta(dat7, vars = c("books"))

## Forming grouping variables
Small case numbers pose a risk of anonymity. For variables with a number of cases <6, which are not ratings, feelings or opinions and that contain sensitive information about individuals, it is advisable to combine small numbers of cases into meaningful groups. To do so, `cloneVariable()` creates a new _FDZ variable by cloning the variable to be grouped. For `varName` the original variable name is entered, for `new_varName` the variable name including the appendix "_FDZ". The variable label is automatically adjusted with the option `label_suffix`.
`recodeGADS()` replaces the old values (`oldValues`) by new ones (`newValues`) and `changeValLabels()` changes the value labels of the new groupings.
```r
extractMeta(dat7, vars = c("age"))
dat8 <- cloneVariable(dat7, varName = "age", new_varName = "age_FDZ", label_suffix = " (Zur Anonymisierung aggregiert (FDZ))")
dat8 <- recodeGADS(dat8, varName = "age_FDZ", oldValues = c(13, 14, 15, 17),
                  newValues = c(13, 13, 15, 15))
dat8 <- changeValLabels(dat8, varName = "age_FDZ", value = c(13, 15), valLabel = c("13-14", "15-17"))
extractMeta(dat8, vars = c("age_FDZ"))

Groupings for multiple variables can also be performed using a loop. For this the variables are given a temporary name consisting of the variable name and the suffix "_FDZ". varname represents each individual variable, which was summarized before in gr_var. The individual steps (cloneVariable(), recodeGADS(),changeValLabels()) are therefore performed for each of the variables in turn.

extractMeta(dat8, vars = c("grade_math"))
dat9 <- dat8
gr_var <- c("grade_math", "grade_germ", "grade_eng")
for (varname in gr_var) {
  tempnewname <- paste0(varname, "_FDZ")
  dat9 <- cloneVariable(dat9, varName = varname,
                        new_varName = tempnewname, label_suffix = " (Zur Anonymisierung aggregiert (FDZ))")
  dat9 <- recodeGADS(dat9, varName = tempnewname,
                      oldValues = c(1:6),
                      newValues = c(rep(1,4),rep(2,2)))
  dat9 <- changeValLabels(dat9, varName = tempnewname, value = c(1, 2),
                           valLabel = c("passed", "not passed"))
}
extractMeta(dat9, vars = c("grade_math_FDZ"))

Now two groups have been created. The first group includes the students with grades from 1-4, which means they have passed. In the other group, there are the students with grades 5 and 6, which means they have not passed. These groups were created for all three subjects Math, German and English at once.

Creating the version variable

To indicate the state of the data set, each data set is provided with a version variable. It contains the version of the data record and the date of the last change. createVariable() creates this variable whereas relocateVariable() places it at the beginning of the record so it has a consistent, visible place.

version_name <- "Version_v1_2023_11"
dat10 <- createVariable(dat9, varName = version_name)
dat10 <- relocateVariable(dat10, var = version_name, after = NULL)

Changing variable order

Newly created variables are initially placed at the end of a data set. orderLike() rearranges the variables by specifying the variables in their new order.

dat11 <- orderLike(dat10, newOrder = c("Version_v1_2023_11", "ID", "ID_name", "info", "home", "birth", "age", "age_FDZ", "sex", "siblings", "books", "school", "subjfav", "grade_math", "grade_math_FDZ", "grade_germ", "grade_germ_FDZ", "grade_eng", "grade_eng_FDZ", "career"))

Often it is only a matter of individual variables that are supposed to be sorted in, such as newly created grouping variables behind their emptied original variable. relocateVariable() sets the variables after the selected one.

dat11 <- relocateVariable(dat11, var = "age_FDZ", after = "age")

Exporting the data

Once the data set has been cleaned, it can be saved. The export functions write_spss() and write_stata() are offered for this purpose. The file paths must be adapted in each case.

write_spss(dat11, filePath = "filePath.sav")
write_stata(dat11, filePath = "filePath.dta")

beckerbenj/eatFDZ documentation built on June 11, 2025, 6:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

beckerbenj/eatFDZ
Data checking and validation

In beckerbenj/eatFDZ: Data checking and validation

if necessary, install eatGADS and eatFDZ

install.packages("eatGADS")

remotes::install_github("beckerbenj/eatFDZ")

load eatGADS

path to example data set

Removing all non-ASCII characters

Defining missings semi-automatically

Defining missing values by hand

Recoding IDs

Option 1: Recoding IDs with creation of a template as .csv

Option 2: Recoding IDs with application of a template (e.g. created by a previously prepared, different data set).

Converting string variables to numeric variables

Option 1: Converting all string variables automatically (overwrite variables)

Option 2: Converting string variable individually

Emptying variables

Changing variable and value labels

Creating the version variable

Changing variable order

Exporting the data

R Package Documentation

Browse R Packages

We want your feedback!

beckerbenj/eatFDZ Data checking and validation

In beckerbenj/eatFDZ: Data checking and validation

if necessary, install eatGADS and eatFDZ

install.packages("eatGADS")

remotes::install_github("beckerbenj/eatFDZ")

load eatGADS

path to example data set

Removing all non-ASCII characters

Defining missings semi-automatically

Defining missing values by hand

Recoding IDs

Option 1: Recoding IDs with creation of a template as .csv

Option 2: Recoding IDs with application of a template (e.g. created by a previously prepared, different data set).

Converting string variables to numeric variables

Option 1: Converting all string variables automatically (overwrite variables)

Option 2: Converting string variable individually

Emptying variables

Changing variable and value labels

Creating the version variable

Changing variable order

Exporting the data

R Package Documentation

Browse R Packages

We want your feedback!

beckerbenj/eatFDZ
Data checking and validation