This vignette illustrates how the Research Data Centre (FDZ) at the Institute for Educational Quality Improvement (IQB) cleans and tidies data for secondary usage purposes. For secondary usage, data is expected to be of high quality, traceable, comprehensible, and anonymized. Furthermore, meaningful metadata should enable secondary users to understand the data. For this purpose, operations such as recoding of identifier variables, emptying variables, changing labels, creating grouping variables, and many more are performed and therefore discussed in this vignette. Overall, the following aspects should be adressed when cleaning data:
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)
The setup for this illustration is the following: ``` {r path}
library(eatGADS)
sav_path <- system.file("extdata", "example_data.sav", package = "eatFDZ")
`eatGADS` is the R package used by the FDZ at IQB for data cleaning purposes. It is available on [CRAN](https://cran.r-project.org/package=eatGADS) and extensively documented via vignettes (e.g., the [Comprehensive Data Cleaning Guide](https://beckerbenj.github.io/eatGADS/articles/data_cleaning.html)). For illustrative purposes, a small, artificial SPSS example data set of secondary school students in Berlin is used. The data set is intentionally flawed and messy. The example data set is part of the `eatFDZ` package and installed alongside. ## Loading the data ```r dat <- import_spss(sav_path) str(dat$dat) # for a small overview in the variables of the data set
The data set is loaded via the import_spss()
command. The data set contains 14 variables with 10 observations, including demographic variables such as age and sex of the students, their family situation and selected grades.
Note that the data set is imported as a GADSdat
object. For detailed explanations on the corresponding object structure and technical details, we refer readers to the eatGADS
documentation.
Non-ASCII characters frequently lead to issues depending on the operating system and statistical software used. If your data contains non-ASCII characters, it might cause compatibility issues or errors when processing the data. fixEncoding()
removes non-ASCII characters from a character vector or a GADSdat
object.
dat$dat$ID_name dat <- fixEncoding(dat, input = "other") dat$dat$ID_name
The non-ASCII character in the ID name has been changed to ae
.
A crucial meta data property for SPSS data are missing tags.
Sometimes not all missing values are correctly tagged as missing. Via eatGADS
, missing tags can be corrected semi-automatically. This can either be done based on wordings of existing value labels (checkMissings()
) or based on a value range of existing value labels (checkMissingsByValues()
).
checkMissings()
tests whether a set of specified regular expressions occurs in the value labels and compares the corresponding missing tags. For instance the variable books
contains a value label omitted
, which has not been tagged as missing.
extractMeta(dat, vars = c("books")) # usually, we wouldn't know yet that this variable contains these value labels. This is for demonstration purposes to see the before and after effect of the function missinglabels <- paste("missing", "unknown", "omitted", sep = "|") dat1 <- checkMissings(dat, missingLabel = missinglabels) extractMeta(dat1, vars = c("books"))
checkMissings()
has now automatically tagged the value -99
with the label omitted
as missing (miss
).
checkMissingsByValues()
checks whether value labels exist for a specified range of numeric values and whether missing tags are provided accordingly:
extractMeta(dat1, vars = c("school")) # usually, we wouldn't know yet that this variable contains these value labels. This is for demonstration purposes to see the before and after effect of the function dat2 <- checkMissingsByValues(dat1, missingValues = -50:-99) extractMeta(dat2, vars = c("school"))
The previous function had already tagged two of the missings correctly, but the value label -98
with the value label unclear answer
was not used. Now, it is contained in the specified range and therefore tagged as a missing.
At times, only individual missings need to be tagged. changeMissings()
tags missing values as missings by specifying them under value
and marks them as miss
under missings.
extractMeta(dat2, vars = c("info")) dat3 <- changeMissings(dat2, varName = "info", value = c("-97", "-98", "-99"), missings = c("miss", "miss", "miss")) extractMeta(dat3, vars = c("info"))
The values -97
, -98
and -99
are now tagged as miss
.
Changing the missing tags is also possible for multiple variables at the same time using changeMissings()
.
extractMeta(dat2, vars = c("info", "career")) dat3 <- changeMissings(dat2, varName = c("info", "career"), value = c("-97", "-98", "-99"), missings = c("miss", "miss", "miss")) extractMeta(dat3, vars = c("info", "career"))
Here, the values for both variables were simultaneously tagged as missings.
ID variables are recoded so no hidden identifiers such as state, school or even name codes allow conclusions to be drawn about individual persons. To ensure that the IDs remain the same across multiple subsets and describe the same person, they are recoded using a template. First, a path for the template is created.
f <- tempfile() ID_recode_template <- file.path("f") # use your own file path instead of the tempfile
Then there are two options: Creating a new template (option 1 - for the first subset) or applying an existing template (option 2 - for the following subsets).
autoRecode()
creates a recoding template for the variable ID
, saved as a .csv file, and in the course of this a new variable with the variable suffix _FDZ
and an explanatory label suffix.
dat3 <- autoRecode(dat3, var = "ID", var_suffix = "_FDZ", csv_path = ID_recode_template, label_suffix = " (recoded by FDZ)")
Already existing templates can be used for following subsets. read.csv()
imports the template, which can be applied with autoRecode()
template <- read.csv(ID_recode_template) dat3 <- autoRecode(dat3, var = "ID", suffix = "_FDZ", csv_path = ID_recode_template, template = template, label_suffix = "(rekodiert FDZ)") # Apply template and overwrite
removeVars()
deletes variables from the data set. In this case, the old ID variable is deleted.
dat3 <- removeVars(dat3, vars = c("ID"))
For (semi)open variables there are two procedures. If it is possible (e.g., for a small data set) to go through all responses to check whether there are no strong privacy violations or insults included, the variables are converted into numeric ones and then emptied in the Scientific Use Files. Thereby, the information remain accessible on a meta-level, but can no longer be assigned to a person. If there are too many responses to check or privacy cannot be granted, the variable is emptied entirely in the Scientific Use Files. To convert the string variables into numeric ones, all string variables are identified first.
all_types <- do.call(rbind, lapply(namesGADS(dat3), function(nam) data.frame(varName = nam, type = class(dat3$dat[[nam]])))) all_types[all_types$type == "character", ]
In this case the variables ID_name
, info
, subjfav
and birth
were recognized as string variables. In the next step they are converted.
multiChar2fac()
converts the string variables identified in advance into numeric variables using a for loop.
charNames <- all_types[all_types$type == "character", "varName"] for(charName in charNames) { dat4 <- multiChar2fac(dat3, vars = charName, var_suffix = "", label_suffix = "") }
multiChar2fac()
converts the string variables named in vars
individually into numeric variables. If the option var_suffix
is used, a new variable is created.
dat4 <- multiChar2fac(dat4, vars = c("ID_name", "info", "subjfav", "birth"), var_suffix = "_FDZ", label_suffix = "")
If the var_suffix
option is left blank, the existing variable is overwritten.
dat4 <- multiChar2fac(dat4, vars = c("ID_name", "info", "subjfav", "birth"), var_suffix = "", label_suffix = "")
Variables that may not even be offered via a remote computing system due to strong data protection concerns are emptied for all data sets. emptyTheseVariables()
empties the listed variables and adds an explanation to the variable label on why the information was removed.
extractMeta(dat4, vars = c("info", "ID_name")) dat4_empty <- c("info", "ID_name") dat5 <- emptyTheseVariables(dat4, vars = dat4_empty, label_suffix = " (Zur Anonymisierung geleert (FDZ))") extractMeta(dat5, vars = c("info", "ID_name"))
dat4
corresponds to the data set, vars
lists the variables to be emptied and label_suffix
automatically adds the explanation to the variable label indicating that the data has been emptied for anonymization reasons.
Variable and value labels help us with the traceability and comprehensibility of a data set. ChangeVarLabels()
selects the variable to be changed with varName
and assigns the corresponding label with varLabel
. ChangeValLabels()
works in a similar way, the corresponding value is selected in value
and the label for this value is created with valLabel
.
extractMeta(dat5, vars = c("school")) dat6 <- changeVarLabels(dat5, varName = "school", varLabel = "School type") extractMeta(dat6, vars = c("school"))
## Forming grouping variables Small case numbers pose a risk of anonymity. For variables with a number of cases <6, which are not ratings, feelings or opinions and that contain sensitive information about individuals, it is advisable to combine small numbers of cases into meaningful groups. To do so, `cloneVariable()` creates a new _FDZ variable by cloning the variable to be grouped. For `varName` the original variable name is entered, for `new_varName` the variable name including the appendix "_FDZ". The variable label is automatically adjusted with the option `label_suffix`. `recodeGADS()` replaces the old values (`oldValues`) by new ones (`newValues`) and `changeValLabels()` changes the value labels of the new groupings. ```r extractMeta(dat7, vars = c("age")) dat8 <- cloneVariable(dat7, varName = "age", new_varName = "age_FDZ", label_suffix = " (Zur Anonymisierung aggregiert (FDZ))") dat8 <- recodeGADS(dat8, varName = "age_FDZ", oldValues = c(13, 14, 15, 17), newValues = c(13, 13, 15, 15)) dat8 <- changeValLabels(dat8, varName = "age_FDZ", value = c(13, 15), valLabel = c("13-14", "15-17")) extractMeta(dat8, vars = c("age_FDZ"))
Groupings for multiple variables can also be performed using a loop. For this the variables are given a temporary name consisting of the variable name and the suffix "_FDZ". varname
represents each individual variable, which was summarized before in gr_var
. The individual steps (cloneVariable()
, recodeGADS()
,changeValLabels()
) are therefore performed for each of the variables in turn.
extractMeta(dat8, vars = c("grade_math")) dat9 <- dat8 gr_var <- c("grade_math", "grade_germ", "grade_eng") for (varname in gr_var) { tempnewname <- paste0(varname, "_FDZ") dat9 <- cloneVariable(dat9, varName = varname, new_varName = tempnewname, label_suffix = " (Zur Anonymisierung aggregiert (FDZ))") dat9 <- recodeGADS(dat9, varName = tempnewname, oldValues = c(1:6), newValues = c(rep(1,4),rep(2,2))) dat9 <- changeValLabels(dat9, varName = tempnewname, value = c(1, 2), valLabel = c("passed", "not passed")) } extractMeta(dat9, vars = c("grade_math_FDZ"))
Now two groups have been created. The first group includes the students with grades from 1-4, which means they have passed. In the other group, there are the students with grades 5 and 6, which means they have not passed. These groups were created for all three subjects Math, German and English at once.
To indicate the state of the data set, each data set is provided with a version variable. It contains the version of the data record and the date of the last change. createVariable()
creates this variable whereas relocateVariable()
places it at the beginning of the record so it has a consistent, visible place.
version_name <- "Version_v1_2023_11" dat10 <- createVariable(dat9, varName = version_name) dat10 <- relocateVariable(dat10, var = version_name, after = NULL)
Newly created variables are initially placed at the end of a data set. orderLike()
rearranges the variables by specifying the variables in their new order.
dat11 <- orderLike(dat10, newOrder = c("Version_v1_2023_11", "ID", "ID_name", "info", "home", "birth", "age", "age_FDZ", "sex", "siblings", "books", "school", "subjfav", "grade_math", "grade_math_FDZ", "grade_germ", "grade_germ_FDZ", "grade_eng", "grade_eng_FDZ", "career"))
Often it is only a matter of individual variables that are supposed to be sorted in, such as newly created grouping variables behind their emptied original variable. relocateVariable()
sets the variables after the selected one.
dat11 <- relocateVariable(dat11, var = "age_FDZ", after = "age")
Once the data set has been cleaned, it can be saved. The export functions write_spss()
and write_stata()
are offered for this purpose. The file paths must be adapted in each case.
write_spss(dat11, filePath = "filePath.sav") write_stata(dat11, filePath = "filePath.dta")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.