knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
One way to reduce identifiability of a data set is by converting a categorical variable to have a more aggregated taxonomy (i.e. a many-to-one mapping). Here we refer to such a method as a 'blur' as it causes features to be joined together in such a way to hide the underlying information.
As an example, consider the ShiftsWorked
data:
library(deident) head(ShiftsWorked)
A simple 'blur' might be to change the taxonomy of 'Shift' e.g. combine 'Day' and 'Night' into a new group 'Working' and ignore the 'Rest' shifts. To do this we define the values we wish to change as a vector, build a pipeline and apply it to the data:
shift_blur <- c("Day" = "Working", "Night" = "Working") blur_pipe <- ShiftsWorked |> add_blur(Shift, blur=shift_blur) apply_deident(ShiftsWorked, blur_pipe)
category_blur
utilityApplying the blur is relatively simple, but constructing it can be complex. Consider the starwars
data set supplied by dplyr:
starwars <- dplyr::starwars head(starwars)
And notably the species
variable:
table(starwars$species)
Imagine we wanted to reduce identifiability by aggregating the data into Human
vs Non-Human. We could code the vector by hand, but human error can lead to mistakes. To aid in designing complex blurs we supply the category_blur
utility which uses regex to define the groups.
human_blur <- category_blur( starwars$species, "NotHuman" = "^(?!Human)" # Doesn't start with "Human" )
And the vector returned can be passed into a new pipeline as before.
species_pipe <- starwars |> add_blur(species, blur=human_blur) new_starwars <- apply_deident(starwars, species_pipe) table(new_starwars$species)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.