knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Linking survey data with SGICs (Subject Generated Identification-Codes)? Awesome! Just remember, you need to validate those IDs. That's how you get clean data and make sure the link-up goes smoothly.
This vignette shows you:
How to perform plausibility checks on different SGIC components.
How to perform plausibility checks on non-SGIC variables that may serve as additional identifiers.
How to detect duplicate cases using a combination of variables as unique identifiers.
To check the plausibility of ID-related variables in a dataset, trustmebro
provides several functions beginning with the prefix inspect. Every inspect-function returns a boolean value, indicating whether a value has passed or failed the plausibility check.
We`ll start by loading trustmebro and dplyr:
library(trustmebro) library(dplyr)
The survey data we use is the trustmebro::sailor_students
dataset. It contains fictional student assessment data from students of the sailor moon universe.
sailor_students
The variable sgic
stores SGICs created by students. Each SGIC is a seven-character string created according to the following instructions:
Characters 1-3 (letters):
First letter of given name (1st character)
Last letter of given name (2nd character)
First letter of family name (3rd character)
Characters 4-7 (digits):
Birthday (4th and 5th character)
Month of birth (6th and 7th character)
We can use trustmebro::inspect_characterid
to check if the provided SGICs adhere to the expected pattern of three letters followed by four digits. The expected structure can be defined using the regular expression "^[A-Za-z]{3}[0-9]{4}$"
, which we can then pass to the function using the pattern =
argument. For seamless integration into your data workflow, this function can be conveniently combined with dplyr::mutate
:
sailor_students %>% mutate(structure_check = inspect_characterid( sgic, pattern = "^[A-Za-z]{3}[0-9]{4}$")) %>% select(sgic, structure_check)
We created trustmebro::inspect_characterid
with SGICs in mind, but of course, any other non-SGIC strings can also be checked using a specified regular expression.
Since the SGIC should end with a date of birth, you can verify the plausibility of this date of birth using trustmebro::inspect_birthdaymonth
. This function checks if a string contains exactly four digits representing a valid date of birth. As before, you can combine trustmebro::inspect_birthdaymonth
with dplyr::mutate
to generate a plausibility check variable:
sailor_students %>% mutate(birthdate_check = inspect_birthdaymonth(sgic)) %>% select(sgic, birthdate_check)
Some SGICs only use the single day or month a person was born. In this case, you can use of trustmebro::inspect_birthday
or trustmebro::inspect_birthmonth
accordingly.
Besides a SGIC, other variables in a given dataset might be used to identify cases. As mentioned above, trustmebro::inspect_characterid
can be used for any string that should follow a specific pattern. Furthermore, this package also provides functions for checking other data types beyond strings.
We can use trustmebro::inspect_numberid
to check if a number matches an expected length. In our dataset, school
should be a five-digit number. combined with dplyr::mutate
, we can add a plausibility variable for the schoolnumber, just as we did before:
sailor_students %>% mutate(school_check = inspect_numberid(school, 5)) %>% select(school, school_check)
In the process of using non-SGIC variables as identifiers, categorical data is often recoded to ensure consistency within a workflow. We can use trustmebro::inspect_valinvec
to check if a value exists in a recode map. The recode map should be a named vector, where the names represent the keys. In our dataset, we want to inspect if all values in gender
conform to this recode map:
recode_gender <- c(Male = "M", Female = "F")
The function checks if a value is present as a key. Combine with dplyr::mutate
to add a variable that contains the check results:
sailor_students %>% mutate(gender_check = inspect_valinvec(gender, recode_gender)) %>% select(gender, gender_check)
So far, we've checked if SGIC
, school
and gender
contain plausible values. Last, we want to ensure that these variables, when used together as identifiers, uniquely identify a single case and that there are no duplicate entries based on these variables. trustmebro::find_dupes
checks whether the combination of identifiers is unique by adding a has_dupes variable to the dataset. To find duplicates in your data, use it like this:
sailor_students %>% find_dupes(school, sgic, gender) %>% select(school, sgic, gender, has_dupes)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.