library(rmarkdown)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The cell-key method (CKM) based random noise approach is a post-tabular Statistical Disclosure Control (SDC) technique. Random noise will be added to each table cell, according to some noise probability distribution - also referred to as perturbation table - and a mechanism to draw from that noise distribution in the so-called lookup step. The perturbation table consists of transition probabilities that describe the probabilities of transitioning from one state (e.g. a given original frequency count) to another (e.g. target frequency count). In other words, they are referred to as the probabilities of the random noises.

The goal of this package is to produce perturbation tables that can be used for applying random noise to statistical tables (also described as "to perturb statistical tables") by SDC tools like the R-package cellKey or the software Tau-Argus.

Features of the package

We now briefly mention the features of the ptable package before we actually show how to create perturbation tables in the following sections.

Main features

As suggested in @marley2011 and illustrated in @giessing2016 the package implements a maximum entropy approach to compute transition probabilities that guarantee certain characteristics for the random noise:

Graphical User interface (GUI)

In the ptable package there is also a shiny app for first time users and visual-style learners. The shiny app provides a visualization mode using the function ptable() and allows users to experiment with different parameter settings while getting direct feedback by means of graphical plots and summaries. In this way, users can visually learn how parameters effect the probability distribution.

Note: The current GUI provides perturbation tables for frequency tables (table = 'cnts') only.

Auxiliary features

The package provides auxiliary functions that allow to process and review the results. Besides the visualization mode of the shiny app, users have also a code-based possibility to create visualization plots directly using the function plot(). The function allows to display a variety of plots (among others the noise probability distribution plot, a perturbation panel, etc.).

Further, pt_export() allows users to write the perturbation table into an external file. The format of the file can be specified such that it can be made available to Tau-Argus or the software SAS.

Frequency Tables {#fretab}

Note: The following examples are illustrated for frequency tables with cell counts. Differences to magnitude tables with continuous/numeric variables are highlighted in the subsequent section.

Examples of the main features

Load the Package

library(ptable)
packageVersion("ptable")

Minimal Parameter Setting and Computation of the Transition Probabilities

Computation of the Transition Probabilities

# note: 
# all parameters except for maximum noise D and variance V have default values
ptab <- create_cnt_ptable(D = 2, V = 1)

The minimum set of parameters that have to be specified are:

The result of the above function is an object of class "ptable" which contains the following slots:

str(ptab)

The most relevant and important slots of the object are:

The Transition Matrix

Let's have a look at the transition matrix (i.e. at the slot @tMatrix) of the object ptab:

# note: to look at a specific slot, just name the object and add the 
# corresponding slot with a leading "@"
ptab@tMatrix

Each row of the transition matrix represents the noise distribution for an original frequency count. The probability that an original frequency count of 1 becomes a 3 (i.e. the 1 is perturbed with a noise value of +2) is r ptab@tMatrix[2,4].

diag(ptab@tMatrix)

The main diagonal shows the preservation probabilities. These are the probabilities that the original frequencies remain unchanged. In this instance, the probability that an original frequency 2 remains unchanged is r diag(ptab@tMatrix)[2+1].

Symmetry - and what does it mean in the context of perturbation tables?

As you may have recognized, the transition matrix has a finite number of rows (that represent original frequency counts) and columns (that represent the target frequency counts).

# let's have a look at the number of different original positive frequency 
# counts that will be treated
params <- slot(ptab, "pParams")
params@ncat
# if this number is added by +1 (for the zero count) we get
params@ncat + 1

The number depends on both, the maximum perturbation D and the threshold value js. The last row of the transition matrix delivers a symmetric distribution. This distribution will be applied to each original frequency larger than this frequency.

# the object @pClasses shows all original frequencies 
# that have their own distribution
ptab@pClasses

# symmetry is achieved for the original frequency count i=...
max(ptab@pClasses)
# or
ptab@pClasses[params@ncat + 1]

In the given example, each frequency count larger than r max(ptab@pClasses) will be perturbed according to the distribution for i=r max(ptab@pClasses). For example, in case of i=3 the distribution reads as follows

a <- ptab@tMatrix[max(ptab@pClasses) + 1, ]
names(a) <- as.character(c(1:5))
a

or in case of i=12326

b <- ptab@tMatrix[max(ptab@pClasses) + 1, ]
names(b) <- as.character(12326 + c(-2:2))
b

Given this symmetry, the transition matrix can be displayed in the reduced form. There is no need to define more rows than up to the case of symmetry.

Output Checking and Troubleshooting

Next, we will check the empirical results that can be used for troubleshooting:

ptab@empResults

The matrix has the following columns:

As can be seen in the example, the preset variance of V=1 does not hold for the original frequency i=1. To fulfill the condition of a fixed variance, we re-run the computation with a different variance parameter. Let's try a variance parameter V=0.9:

ptab_new <- create_cnt_ptable(D = 2, V = 0.9)
ptab_new@empResults

The new computation with the updated variance parameter results in a perturbation table that fulfills all conditions. Of course, the resulting transition matrix now differs from the first one:

ptab_new@tMatrix

The default ptable and the modified ptable

The default ptable

For our SDC tools - the cellKey-package and TauArgus - we use the ptable with noise intervals with result after accumulating the noise probabilities

ptab_new@pTable

The noise intervals in the standard ptable are ordered from -D to D. A modified ptable still has the same properties as the standard ptable but can ensure a higher protection of perturbed frequency tables since the noise probabilities are split and the intervals are rearranged.

The modified ptable

Modify the ptable to have a higher level of protection

mod_ptab_new <-
  modify_cnt_ptable(ptab_new, threshold = 0.2, seed = 123)

The noise probabilities larger than the threshold value will be split. Then, the split noise probabilities are randomly rearranged using a seed. Finally, the intervals of the ptable will be adjusted.

mod_ptab_new@pTable

The goal of the following subsections is to create perturbation tables with extended and afterwards with advanced parameter settings of the function create_cnt_ptable().

Extended Parameter Setting

The remaining parameters (among others js and pstay), are set to default if not specified:

However, let's change and modify them with respect to our needs:

Setting the probability of an original frequency to remain unperturbed

# note:  once again, check the diagonal entries
ptable_e21 <- create_cnt_ptable(D = 4, V = 1, pstay = 0.5)
# note:  once again, check the empirical results or the diagonal entries
ptable_e21@empResults
diag(ptable_e21@tMatrix)

The probability of an original frequency to remain unperturbed (i.e. that the target frequency equals the original frequency) is set to r diag(ptable_e21@tMatrix)[2]*100%. In the advanced parameter settings we will learn how we can assign a specific parameter to each original frequency count.

Important note regarding "false positives": Zeroes will not be perturbed to positive frequency counts since unbiasedness is a main characteristic of the cell-key based method.^[If false positives are an important part of a protection concept, we recommend to use (targeted record swapping)[https://github.com/sdcTools/CensusProtection/] as protection method (alone, or in combination with CKM, as proposed by the SGA "Harmonized protection of census data in the ESS".]

Do not allow for positive cell values equal to or below a specified threshold value

ptable_e22 <- create_cnt_ptable(D = 5, V = 3, js = 2)
ptable_e22@tMatrix

Since the threshold value js is set to 2, we do not allow target frequencies 1 and 2. Hence, the corresponding columns contain zero probabilities.

Important note regarding "false zeroes": Since the possibility of false zeroes is an important strength of a perturbative SDC method, zeroes are always allowed and cannot be excluded from the set of target frequencies.

Further parameters of the function create_cnt_ptable() are explained in the help page.

Advanced Parameter Setting

Several parameters of the function create_cnt_ptable() that we have learned so far, can also be assigned to each original positive frequency count separately using the vector notation in R.

First, we recommend to check how many different entries do we need. Therefore, let's begin with the simple parameter setting

result <- create_cnt_ptable(D = 4, V = 1)
params <- slot(result, "pParams")
# and let's check the number of different positive original frequencies
params@ncat

Let's assume we want to assign different probabilities for pstay, one for each positive original count, we have to specify a vector of length r params@ncat:

# note: so far we have used a scalar for "pstay"", but now we use a vector
result <-
  create_cnt_ptable(D = 4,
                    V = 1,
                    pstay = c(0.5, 0.5, 0.7, 0.8))
params <- slot(result, "pParams")

# let's check the results
result@empResults

The perturbation table we just created is arranged such that 50% of ones and twos, 70% of i=3 and 80% of all frequencies i>3 retain unperturbed.

The same holds for the argument mono.

Examples of the auxiliary features

Analyzing and plotting the results

# note: in the example we are working with the following result
result <-
  create_cnt_ptable(
    D = 2,
    V = 1.08,
    js = 1,
    mono = c(TRUE, TRUE, FALSE, TRUE)
  )

The resulting object result can now be plotted using the function plot(). According to the specification of the argument type, we will receive different plots as shown in the following subsections.

Plotting the Transition Probabilities in a Matrix

# note:  we have to use the argument type for specifying the plot
plot(result, type = "t")

The transition matrix plot is a heat-map plot indicating the probability (the darker the background of a transition probability, the higher the probability). Since the threshold value is set to js=1, we don't allow for ones. That means, the target frequencies 1 are not allowed and, hence, the corresponding column in the transition matrix only consist of zero probabilities.

Plotting the Perturbation Panel

The perturbation table can be graphically represented as a perturbation panel:

plot(result, type = "p")

Remember: For the demonstration we use a perturbation panel with default intervals.

Each bar corresponds to an original frequency count, different colors correspond to different noise values, and the width of the colored sub-bars corresponds to the respective transition probability.

An original frequency count i=1 changes with probability 0.5133 to j = 0. The adjustment equals -1. Accordingly a light blue bar of length 0.5133 is drawn in the panel. With a probability of 0.46 the original i = 1 becomes j = 2, so the perturbation equals +1. A light green bar of length 0.46 is attached. Finally, with a probability of 0.0267 the original i = 1 becomes j = 3, so the perturbation equals +2. Accordingly a dark green bar of length 0.0267 is attached. This way, all probabilities for the row i = 1 of the perturbation table are shown in the panel. The probabilities for i = 2, 3 and 4 are shown in the panel in a similar way.

It can be seen from the perturbation panel where the x-axis shown at the bottom with probability between 0 and 1 corresponds to a cumulative probability distribution.

Plotting the Distribution of the Perturbation Values/Noise

plot(result, type = "d")

The distribution plot of the perturbation/noise values is very helpful for designing the noise and troubleshooting. Each panel of the plot represents an original frequency count. The most important information of the empirical results (see section above) are also given within each panel: M shows the bias, V is the achieved variance (and not the preset variance parameter!). Further, violations are reported in red (e.g. in case of a not fulfilled fixed variance).

Save the plot to a file

The plots shown above can also be saved to a file (e.g. png, pdf):

plot(result, type = "d", file = "graph.pdf")

Exporting the results

To use the perturbation table in SDC tools like Tau-Argus or SAS, users can export the result:

# for Tau-Argus
pt_export(result, file = "ptable.txt", SDCtool = "TauArgus")
# or for SAS
pt_export(result, file = "ptable.txt", SDCtool = "SAS")

Note: For the purpose of production please modify the ptable and don't use the ptable with default intervals. But be aware of the interpretation using the perturbation panel.

Troubleshooting Example: Violation of the fixed variance condition

Let's assume we want to assign different probabilities for pstay as done in an example above, but with very large values:

result <-
  create_cnt_ptable(D = 4,
                    V = 1.5,
                    pstay = c(0.8, 0.9, 0.9, 0.9))
# let's check the results
result@empResults

Since the values for pstay are too large to fulfill the fixed variance condition, the algorithm reduces the values of pstay iteratively. We then receive valid results for i=2,3 and 4. However, in case of i=1 it did not work. The variance differs and we violate the fixed variance condition.

By the way: The same conclusions can be drawn from the distribution plot in a more comfortable way

plot(result, type = "d")

Consequently, we would now have to adjust the variance to V=1.225 and re-run the computation:

result <-
  create_cnt_ptable(D = 4,
                    V = 1.225,
                    pstay = c(0.8, 0.9, 0.9, 0.9))
plot(result, type = "d")

Alternatively, we can change the monotony condition for i=1:

result <- create_cnt_ptable(
  D = 4,
  V = 1.5,
  pstay = c(0.8, 0.9, 0.9, 0.9),
  mono = c(FALSE, TRUE, TRUE, TRUE)
)
plot(result, type = "d")

Magnitude Tables {#magtab}

Simple Setting

Setting the parameters for magnitude tables

If the goal is to create perturbation parameters suitable for numerical variables (magnitude tables), one may use function create_num_ptable() as shown below.

res <- create_num_ptable(D = 2, V = 1, icat = c(1, 2))

However, when creating objects for magnitude tables, users can also set the following arguments:

References



tenderle/ptable documentation built on March 5, 2023, 3:35 a.m.