Shuffle Example"
In deident: Persistent Data Anonymization Pipeline

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(deident)

While individual variables can often make data personally identifiable, we can often tell quickly if a variable has this risk (e.g. names, social security numbers, etc). The less readily considered situation is when a collection of variables render individuals identifiable.

As an example, consider the starwars data set (borrowed from dplyr):

head(starwars)

Inspection of the data shows species can be a unique identifier (e.g. 'Admiral Ackbar' is the only 'Mon Calamari') so we may consider aggregating species:

starwars |> 
  dplyr::filter(species == "Mon Calamari")

However, while knowing someone is 'Human' does not have the same effect, if we also knew they were from 'Coruscant' and had 'blond' hair (each of which is not uniquely identifiable) if used in combination we reduce the data to a single case:

starwars |> 
  dplyr::filter(species == "Human", 
                homeworld == "Coruscant",
                hair_color == "blond"
                )

Hence, individual columns can contain useful information but we may not wish to disclose the inter-variable correlations. To aid with this, we introduce the shuffling method which performs column wise sampling without replacement:

NB: we set a random seed using set.seed here for reproducibility. We recommend users avoid this step when using the package in production code.

set.seed(101)

shuffle_pipe <- starwars |> 
  add_shuffle(species, homeworld, hair_color)

new_starwars <- apply_deident(starwars, shuffle_pipe)

head(new_starwars)

A Shuffle hence preserves the column summaries, e.g. modal values and distributions, but breaks inter-column behaviours which might lead to identification.

new_starwars |> 
  dplyr::filter(species == "Human", 
                homeworld == "Coruscant"
                )

`Grouped Shuffling`

Clearly there will be situations in which inter-variable dependencies are key to our understanding of the data, and we may wish to preserve the column metrics within strata. Such a situation is foreseen, and 'shuffling' can be performed within a grouped data set as easily as on the whole data:

grouped_shuffle_pipe <- starwars |> 
  add_group(gender) |> 
  add_shuffle(species, homeworld, hair_color) |>
  add_ungroup()

apply_deident(starwars, grouped_shuffle_pipe)