Inspect SGICs"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Linking survey data with SGICs (Subject Generated Identification-Codes)? Awesome! Just remember, you need to validate those IDs. That's how you get clean data and make sure the link-up goes smoothly.

This vignette shows you:

To check the plausibility of ID-related variables in a dataset, trustmebro provides several functions beginning with the prefix inspect. Every inspect-function returns a boolean value, indicating whether a value has passed or failed the plausibility check.

We`ll start by loading trustmebro and dplyr:

library(trustmebro)
library(dplyr)

Data: sailor_students

The survey data we use is the trustmebro::sailor_students dataset. It contains fictional student assessment data from students of the sailor moon universe.

sailor_students

SGIC Plausibility

The variable sgic stores SGICs created by students. Each SGIC is a seven-character string created according to the following instructions:

Characters 1-3 (letters):

Characters 4-7 (digits):

Check Character IDs

We can use trustmebro::inspect_characterid to check if the provided SGICs adhere to the expected pattern of three letters followed by four digits. The expected structure can be defined using the regular expression "^[A-Za-z]{3}[0-9]{4}$", which we can then pass to the function using the pattern = argument. For seamless integration into your data workflow, this function can be conveniently combined with dplyr::mutate:

sailor_students %>% 
  mutate(structure_check = 
           inspect_characterid(
             sgic, pattern = "^[A-Za-z]{3}[0-9]{4}$")) %>%
  select(sgic, structure_check)

We created trustmebro::inspect_characterid with SGICs in mind, but of course, any other non-SGIC strings can also be checked using a specified regular expression.

Check Birthdate-Components

Since the SGIC should end with a date of birth, you can verify the plausibility of this date of birth using trustmebro::inspect_birthdaymonth. This function checks if a string contains exactly four digits representing a valid date of birth. As before, you can combine trustmebro::inspect_birthdaymonth with dplyr::mutate to generate a plausibility check variable:

sailor_students %>% 
  mutate(birthdate_check = 
           inspect_birthdaymonth(sgic)) %>%
  select(sgic, birthdate_check)

Some SGICs only use the single day or month a person was born. In this case, you can use of trustmebro::inspect_birthday or trustmebro::inspect_birthmonth accordingly.

Non-SGIC variables' plausibility

Besides a SGIC, other variables in a given dataset might be used to identify cases. As mentioned above, trustmebro::inspect_characterid can be used for any string that should follow a specific pattern. Furthermore, this package also provides functions for checking other data types beyond strings.

Check Numbers

We can use trustmebro::inspect_numberid to check if a number matches an expected length. In our dataset, school should be a five-digit number. combined with dplyr::mutate, we can add a plausibility variable for the schoolnumber, just as we did before:

sailor_students %>% 
  mutate(school_check = 
           inspect_numberid(school, 5)) %>%
  select(school, school_check)

Check the presence of a value within the recode map

In the process of using non-SGIC variables as identifiers, categorical data is often recoded to ensure consistency within a workflow. We can use trustmebro::inspect_valinvec to check if a value exists in a recode map. The recode map should be a named vector, where the names represent the keys. In our dataset, we want to inspect if all values in gender conform to this recode map:

recode_gender <- c(Male = "M", Female = "F")

The function checks if a value is present as a key. Combine with dplyr::mutate to add a variable that contains the check results:

sailor_students %>% 
  mutate(gender_check = 
           inspect_valinvec(gender, recode_gender)) %>%
  select(gender, gender_check)

Identify Duplicate Cases

So far, we've checked if SGIC, school and gender contain plausible values. Last, we want to ensure that these variables, when used together as identifiers, uniquely identify a single case and that there are no duplicate entries based on these variables. trustmebro::find_dupes checks whether the combination of identifiers is unique by adding a has_dupes variable to the dataset. To find duplicates in your data, use it like this:

sailor_students %>% find_dupes(school, sgic, gender) %>%
  select(school, sgic, gender, has_dupes)


Try the trustmebro package in your browser

Any scripts or data that you put into this service are public.

trustmebro documentation built on June 8, 2025, 11:01 a.m.