library(rmarkdown)
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The cell-key method (CKM) based random noise approach is a post-tabular Statistical Disclosure Control (SDC) technique. Random noise will be added to each table cell, according to some noise probability distribution - also referred to as perturbation table - and a mechanism to draw from that noise distribution in the so-called lookup step. The perturbation table consists of transition probabilities that describe the probabilities of transitioning from one state (e.g. a given original frequency count) to another (e.g. target frequency count). In other words, they are referred to as the probabilities of the random noises.
The goal of this package is to produce perturbation tables that can be used for applying random noise to statistical tables (also described as "to perturb statistical tables") by SDC tools like the R-package cellKey or the software Tau-Argus.
We now briefly mention the features of the ptable package before we actually show how to create perturbation tables in the following sections.
As suggested in @marley2011 and illustrated in @giessing2016 the package implements a maximum entropy approach to compute transition probabilities that guarantee certain characteristics for the random noise:
The maximum noise (i.e. absolute value of any perturbation) is less than a specified integer value D.
One of the main characteristics is the zero mean or unbiased noise in the sense that the expected bias of each separate original frequency is 0. This important characteristic guarantees unbiased noise independent of the distribution of original frequencies in a hypercube or set of hypercubes.
The perturbations have a fixed noise variance $\sigma ^2$ that determines the spread of the possible noise values.
The perturbations will produce no negative frequency counts.
In the perturbed table will be no positive frequency counts less than a specified threshold value js.
In the ptable package there is also a shiny app for first time users and visual-style learners. The shiny app provides a visualization mode using the function ptable()
and allows users to experiment with different parameter settings while getting direct feedback by means of graphical plots and summaries. In this way, users can visually learn how parameters effect the probability distribution.
Note: The current GUI provides perturbation tables for frequency tables (table = 'cnts'
) only.
The package provides auxiliary functions that allow to process and review the results. Besides the visualization mode of the shiny app, users have also a code-based possibility to create visualization plots directly using the function plot()
. The function allows to display a variety of plots (among others the noise probability distribution plot, a perturbation panel, etc.).
Further, pt_export()
allows users to write the perturbation table into an external file. The format of the file can be specified such that it can be made available to Tau-Argus or the software SAS.
Note: The following examples are illustrated for frequency tables with cell counts. Differences to magnitude tables with continuous/numeric variables are highlighted in the subsequent section.
library(ptable)
packageVersion("ptable")
# note: # all parameters except for maximum noise D and variance V have default values ptab <- create_cnt_ptable(D = 2, V = 1)
The minimum set of parameters that have to be specified are:
D
: the maximum noise/perturbation (a positive integer) andV
: the noise or perturbation variance (a positive integer).The result of the above function is an object of class "ptable" which contains the following slots:
str(ptab)
The most relevant and important slots of the object are:
tMatrix
: A transition matrix that describes the perturbation probabilities from one state (original frequency count) to another (target frequency count) .pTable
: Data table needed for the lookup step of a SDC tool that can apply random noise to statistical tables (e.g. the cellKey package or the software Tau-Argus). However, in the following sections there will be explained, how the tables can be used or exported by auxiliary functions.pParams
: The input parameters that result from the preceding function pt_create_pParams()
.empResults
: A data frame for output checking of the constraints.Let's have a look at the transition matrix (i.e. at the slot @tMatrix
) of the object ptab
:
# note: to look at a specific slot, just name the object and add the # corresponding slot with a leading "@" ptab@tMatrix
Each row of the transition matrix represents the noise distribution for an original frequency count. The probability that an original frequency count of 1 becomes a 3 (i.e. the 1 is perturbed with a noise value of +2) is r ptab@tMatrix[2,4]
.
diag(ptab@tMatrix)
The main diagonal shows the preservation probabilities. These are the probabilities that the original frequencies remain unchanged. In this instance, the probability that an original frequency 2 remains unchanged is r diag(ptab@tMatrix)[2+1]
.
As you may have recognized, the transition matrix has a finite number of rows (that represent original frequency counts) and columns (that represent the target frequency counts).
# let's have a look at the number of different original positive frequency # counts that will be treated params <- slot(ptab, "pParams") params@ncat # if this number is added by +1 (for the zero count) we get params@ncat + 1
The number depends on both, the maximum perturbation D
and the threshold value js
. The last row of the transition matrix delivers a symmetric distribution. This distribution will be applied to each original frequency larger than this frequency.
# the object @pClasses shows all original frequencies # that have their own distribution ptab@pClasses # symmetry is achieved for the original frequency count i=... max(ptab@pClasses) # or ptab@pClasses[params@ncat + 1]
In the given example, each frequency count larger than r max(ptab@pClasses)
will be perturbed according to the distribution
for i=r max(ptab@pClasses)
. For example, in case of i=3 the distribution reads as follows
a <- ptab@tMatrix[max(ptab@pClasses) + 1, ] names(a) <- as.character(c(1:5)) a
or in case of i=12326
b <- ptab@tMatrix[max(ptab@pClasses) + 1, ] names(b) <- as.character(12326 + c(-2:2)) b
Given this symmetry, the transition matrix can be displayed in the reduced form. There is no need to define more rows than up to the case of symmetry.
Next, we will check the empirical results that can be used for troubleshooting:
ptab@empResults
The matrix has the following columns:
i
: indicates the original frequency to which the remaining columns are referred top_mean
: shows the bias of the perturbation of an original frequency (should be zero: unbiasedness)p_var
: shows the noise variance of an original frequency (Note: If p_var differs from the chosen input parameter V
- as it does in the example above - we have a violation of the fixed variance condition. In that case, we have to set a different variance parameter or to change other parameters!)p_sum
: the sum of the transition probabilities for an original frequency count must equal to 1p_stay
: corresponds to the diagonal of the transition matrixiter
: any value other than 1 points out discrepancies (e.g. violation of the fixed variance parameter)^[In case of extended parameter settings, i.e. setting the argument pstay=...
the algorithm tries to reduce the preset pstay-value iteratively in order to fulfill the fixed variance condition.] As can be seen in the example, the preset variance of V=1
does not hold for the original frequency i=1
. To fulfill the condition of a fixed variance, we re-run the computation with a different variance parameter. Let's try a variance parameter V=0.9
:
ptab_new <- create_cnt_ptable(D = 2, V = 0.9) ptab_new@empResults
The new computation with the updated variance parameter results in a perturbation table that fulfills all conditions. Of course, the resulting transition matrix now differs from the first one:
ptab_new@tMatrix
For our SDC tools - the cellKey
-package and TauArgus - we use the ptable with noise intervals with result after accumulating the noise probabilities
ptab_new@pTable
The noise intervals in the standard ptable are ordered from -D to D. A modified ptable still has the same properties as the standard ptable but can ensure a higher protection of perturbed frequency tables since the noise probabilities are split and the intervals are rearranged.
Modify the ptable to have a higher level of protection
mod_ptab_new <- modify_cnt_ptable(ptab_new, threshold = 0.2, seed = 123)
The noise probabilities larger than the threshold value will be split. Then, the split noise probabilities are randomly rearranged using a seed. Finally, the intervals of the ptable will be adjusted.
mod_ptab_new@pTable
The goal of the following subsections is to create perturbation tables with extended and afterwards with advanced parameter settings of the function create_cnt_ptable()
.
The remaining parameters (among others js
and pstay
), are set to default if not specified:
js
: the perturbations will not produce target frequencies equal to or below this threshold value (a non-negative integer; default: js=0
)pstay
: the probability of an original frequency to remain unperturbed (a positive value; default: pstay=NA
no preset probability; this default produces the maximum entropy solution) mono
: monotony condition (logical; default: mono=TRUE
to force monotony). The idea is to produce distributions where transition probabilities decrease monotonously when the distance of the target frequency to the original frequency increases.However, let's change and modify them with respect to our needs:
# note: once again, check the diagonal entries ptable_e21 <- create_cnt_ptable(D = 4, V = 1, pstay = 0.5) # note: once again, check the empirical results or the diagonal entries ptable_e21@empResults diag(ptable_e21@tMatrix)
The probability of an original frequency to remain unperturbed (i.e. that the target frequency equals the original frequency) is set to r diag(ptable_e21@tMatrix)[2]*100
%. In the advanced parameter settings we will learn how we can assign a specific parameter to each original frequency count.
Important note regarding "false positives": Zeroes will not be perturbed to positive frequency counts since unbiasedness is a main characteristic of the cell-key based method.^[If false positives are an important part of a protection concept, we recommend to use (targeted record swapping)[https://github.com/sdcTools/CensusProtection/] as protection method (alone, or in combination with CKM, as proposed by the SGA "Harmonized protection of census data in the ESS".]
ptable_e22 <- create_cnt_ptable(D = 5, V = 3, js = 2) ptable_e22@tMatrix
Since the threshold value js
is set to 2, we do not allow target frequencies 1 and 2. Hence, the corresponding columns contain zero probabilities.
Important note regarding "false zeroes": Since the possibility of false zeroes is an important strength of a perturbative SDC method, zeroes are always allowed and cannot be excluded from the set of target frequencies.
Further parameters of the function create_cnt_ptable()
are explained in the help page.
Several parameters of the function create_cnt_ptable()
that we have learned so far, can also be assigned to each original positive frequency count separately using the vector notation in R.
First, we recommend to check how many different entries do we need. Therefore, let's begin with the simple parameter setting
result <- create_cnt_ptable(D = 4, V = 1) params <- slot(result, "pParams") # and let's check the number of different positive original frequencies params@ncat
Let's assume we want to assign different probabilities for pstay
, one for each positive original count, we have to specify a vector of length r params@ncat
:
# note: so far we have used a scalar for "pstay"", but now we use a vector result <- create_cnt_ptable(D = 4, V = 1, pstay = c(0.5, 0.5, 0.7, 0.8)) params <- slot(result, "pParams") # let's check the results result@empResults
The perturbation table we just created is arranged such that 50% of ones and twos, 70% of i=3 and 80% of all frequencies i>3 retain unperturbed.
The same holds for the argument mono
.
# note: in the example we are working with the following result result <- create_cnt_ptable( D = 2, V = 1.08, js = 1, mono = c(TRUE, TRUE, FALSE, TRUE) )
The resulting object result
can now be plotted using the function plot()
. According to the specification of the argument type
, we will receive different plots as shown in the following subsections.
# note: we have to use the argument type for specifying the plot plot(result, type = "t")
The transition matrix plot is a heat-map plot indicating the probability (the darker the background of a transition probability, the higher the probability). Since the threshold value is set to js=1
, we don't allow for ones. That means, the target frequencies 1 are not allowed and, hence, the corresponding column in the transition matrix only consist of zero probabilities.
The perturbation table can be graphically represented as a perturbation panel:
plot(result, type = "p")
Remember: For the demonstration we use a perturbation panel with default intervals.
Each bar corresponds to an original frequency count, different colors correspond to different noise values, and the width of the colored sub-bars corresponds to the respective transition probability.
An original frequency count i=1 changes with probability 0.5133 to j = 0. The adjustment equals -1. Accordingly a light blue bar of length 0.5133 is drawn in the panel. With a probability of 0.46 the original i = 1 becomes j = 2, so the perturbation equals +1. A light green bar of length 0.46 is attached. Finally, with a probability of 0.0267 the original i = 1 becomes j = 3, so the perturbation equals +2. Accordingly a dark green bar of length 0.0267 is attached. This way, all probabilities for the row i = 1 of the perturbation table are shown in the panel. The probabilities for i = 2, 3 and 4 are shown in the panel in a similar way.
It can be seen from the perturbation panel where the x-axis shown at the bottom with probability between 0 and 1 corresponds to a cumulative probability distribution.
plot(result, type = "d")
The distribution plot of the perturbation/noise values is very helpful for designing the noise and troubleshooting. Each panel of the plot represents an original frequency count. The most important information of the empirical results (see section above) are also given within each panel: M
shows the bias, V
is the achieved variance (and not the preset variance parameter!). Further, violations are reported in red (e.g. in case of a not fulfilled fixed variance).
The plots shown above can also be saved to a file (e.g. png, pdf):
plot(result, type = "d", file = "graph.pdf")
To use the perturbation table in SDC tools like Tau-Argus or SAS, users can export the result:
# for Tau-Argus pt_export(result, file = "ptable.txt", SDCtool = "TauArgus") # or for SAS pt_export(result, file = "ptable.txt", SDCtool = "SAS")
Note: For the purpose of production please modify the ptable and don't use the ptable with default intervals. But be aware of the interpretation using the perturbation panel.
Let's assume we want to assign different probabilities for pstay
as done in an example above, but with very large values:
result <- create_cnt_ptable(D = 4, V = 1.5, pstay = c(0.8, 0.9, 0.9, 0.9)) # let's check the results result@empResults
Since the values for pstay
are too large to fulfill the fixed variance condition, the algorithm reduces the values of pstay
iteratively. We then receive valid results for i=2,3 and 4. However, in case of i=1 it did not work. The variance differs and we violate the fixed variance condition.
By the way: The same conclusions can be drawn from the distribution plot in a more comfortable way
plot(result, type = "d")
Consequently, we would now have to adjust the variance to V=1.225
and re-run the computation:
result <- create_cnt_ptable(D = 4, V = 1.225, pstay = c(0.8, 0.9, 0.9, 0.9)) plot(result, type = "d")
Alternatively, we can change the monotony condition for i=1:
result <- create_cnt_ptable( D = 4, V = 1.5, pstay = c(0.8, 0.9, 0.9, 0.9), mono = c(FALSE, TRUE, TRUE, TRUE) ) plot(result, type = "d")
If the goal is to create perturbation parameters suitable for numerical variables (magnitude tables), one may use function create_num_ptable()
as shown below.
res <- create_num_ptable(D = 2, V = 1, icat = c(1, 2))
However, when creating objects for magnitude tables, users can also set the following arguments:
icat
(mandatory): The categories for the interpolation.type
(optional): Distinction between odd and even numbers (see Giessing/Tent 2019).Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.