Small count frequency table suppression"

options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
htmltables = TRUE
if(htmltables){
  source("GaussKable.R")
  P = function(...) G(fun = SuppressSmallCounts, timevar = "geo", ...)
} else { 
  P = function(...) cat("Formatted table not avalable")
}

Introduction

The GaussSuppression package contains several easy-to-use wrapper functions and in this vignette we will look at the SuppressSmallCounts function. In this function, small frequencies are primary suppressed. Then, as always in this package, secondary suppression is performed using the Gauss method.

We begin by creating datasets to be used below. The first examples are based on dataset_a, which has six rows.

library(GaussSuppression)
dataset <- SSBtoolsData("example1")
dataset_a <- dataset[dataset$year == "2014", -4]
dataset_b <- dataset[dataset$year == "2015", -4]
dataset_a

An initial basic example

In the function description (?SuppressSmallCounts), the only visible parameter is maxN in addition to the parameters considered in the define-tables vignette. In the first example, we use maxN = 1 which means that zeros and ones are primary suppressed.

SuppressSmallCounts(data = dataset_a, 
                    dimVar = c("age", "geo"), 
                    freqVar = "freq", 
                    maxN = 1)

A formatted version of this output is given in Table 1 below. Primary suppressed cells are underlined and labeled in red, while the secondary suppressed cells are labeled in purple.

\

Table 1: dimVar = c("age", "geo"), maxN = 1

P(data = dataset_a, # caption = "Table 1", 
                    dimVar = c("age", "geo"), 
                    freqVar = "freq", 
                    maxN = 1)

\

The same output is obtained if microdata is sent as input as illustrated by de code below.

microdata_a <- SSBtools::MakeMicro(dataset_a, "freq")[-4]
output <- SuppressSmallCounts(data = microdata_a, 
                              dimVar = c("age", "geo"), 
                              maxN = 1)

A related point is that the third row of the table can be omitted (data = dataset_a[-3, ]) since the frequency is zero. When the frequency is zero, there is no underlying microdata. Later in this vignette, we address scenarios where the inclusion of zeros may be important.

An hierarchical table

A more advanced example is obtained by including the variable "eu".

SuppressSmallCounts(data = dataset_a, 
                    dimVar = c("age", "geo", "eu"), 
                    freqVar = "freq", 
                    maxN = 2)

A formatted version of this output:

\

Table 2: dimVar = c("age", "geo", "eu"), maxN = 2

P(data = dataset_a, # caption = "Table 2",  
                    dimVar = c("age", "geo", "eu"), 
                    freqVar = "freq", 
                    maxN = 2)

\

As described in the define-tables vignette hierarchies are here detected automatically. The same output is obtained if we first generate hierarchies by:

dimlists <- SSBtools::FindDimLists(dataset_a[c("age", "geo", "eu")])
dimlists

And thereafter run SuppressSmallCounts with these hierarchies as input:

SuppressSmallCounts(data = dataset_a[c("age", "geo", "freq")], 
                    hierarchies = dimlists, 
                    freqVar = "freq", 
                    maxN = 2)

Using the formula interface

Using the formula interface is one way to achieve fewer cells in the output:

SuppressSmallCounts(data = dataset_a, 
                    formula = ~age:eu + geo, 
                    freqVar = "freq", 
                    maxN = 2)

In the formatted version of this output, blank cells indicate that they are not included in the output.

\

Table 3: formula = ~age:eu + geo, maxN = 2

P(data = dataset_a, # caption = "Table 3", 
                    formula = ~age:eu + geo, 
                    freqVar = "freq", 
                    maxN = 2)

\

About suppression of zeros

By default, zeros are suppressed in order to protect against attribute disclosure in frequency tables. However, there are exceptions. Below are several options for handling exceptions.

Zeros not suppressed

One option is to use protectZeros = FALSE.

SuppressSmallCounts(data = dataset_a, 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 4,
                    protectZeros = FALSE)

\

Table 4: dimVar = c("age", "geo", "eu"), maxN = 4, protectZeros = FALSE

P(data = dataset_a, # caption = "Table 4", 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 4,
                    protectZeros = FALSE)

\

Another possibility that gives the same output is:

output <- SuppressSmallCounts(data = dataset_a[-3, ], 
                              dimVar = c("age", "geo", "eu"),  
                              freqVar = "freq", 
                              maxN = 4,
                              extend0 = FALSE, 
                              structuralEmpty = TRUE)

Here the zero-frequency row is omitted in the input. By default, the table is automatically extended so that the Gauss algorithm handles zeros correctly. When this is turned off (extend0 = FALSE), a warning with the following text will appear: "Suppressed cells with empty input will not be protected. Extend input data with zeros?". However, with structuralEmpty = TRUE, the "empty zeros" are assumed to represent structural zeros that must not be suppressed. As exemplified a little further below, one can thus handle data with both structural and non-structural zeros.

Secondary suppressed zeros

We can combine protectZeros = FALSE with secondaryZeros = TRUE.

SuppressSmallCounts(data = dataset_a, 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 3,
                    protectZeros = FALSE, 
                    secondaryZeros = TRUE)

\

Table 5: dimVar = c("age", "geo", "eu"), maxN = 3,
protectZeros = FALSE, secondaryZeros = TRUE

P(data = dataset_a, # caption = "Table 5",  
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 3,
                    protectZeros = FALSE, 
                    secondaryZeros = TRUE)

\

Both structural and non-structural zeros

The example below uses dataset_b, which has two zeros.

  dataset_b

Let's assume that the first zero is considered as a structural zero. In order to account for this characteristic, we will exclude this particular zero and retain the other. As a general rule, we will exclude all structural zeros.

SuppressSmallCounts(data = dataset_b[-2,  ], 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 2,
                    extend0 = FALSE, 
                    structuralEmpty = TRUE)

\

Table 6: dimVar = c("age", "geo", "eu"), maxN = 2,
extend0 = FALSE, structuralEmpty = TRUE

P(data = dataset_b[-2,  ], # caption = "Table 6", 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 2,
                    extend0 = FALSE, 
                    structuralEmpty = TRUE)

\

Now, the data has been processed correctly, the structural zeros will be published while the other zeros are suppressed.

To get the same output with the formula interface, we can use the following code:

SuppressSmallCounts(data = dataset_b[-2,  ], 
                    formula = ~age * (geo + eu),  
                    freqVar = "freq", 
                    maxN = 2,
                    extend0 = FALSE, 
                    structuralEmpty = TRUE,
                    removeEmpty = FALSE)

Please note that in order to include empty cells in the output, you need to set the removeEmpty parameter to FALSE. By default, this parameter is set to TRUE when using the formula interface.

The problem of singletons and zeros

The problem of zeros

When using the standard suppression technique on table dataset_b, many cells are suppressed.

SuppressSmallCounts(data = dataset_b, 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 2)

\

Table 7: dimVar = c("age", "geo", "eu"), maxN = 2

P(data = dataset_b, # caption = "Table 7", 
                    dimVar = c("age", "geo", "eu"),  
                    freqVar = "freq", 
                    maxN = 2)

\

The reason for the Spain suppressions is to prevent the disclosure of zeros, which would be easily revealed if young:Spain is not suppressed. In that case the sum of young:Iceland and young:Portugal can easily be calculated to be zero. Since negative frequencies are not possible, the only possibility is two zeros.

The handling of this problem is standard, but it can be turned off by singletonMethod = "none".

The problem of singletons

This problem occurs when protectZeros = FALSE and secondaryZeros = FALSE (default). We now also look at a larger example that uses dataset which has 18 rows.

output <- SuppressSmallCounts(data = dataset, 
                              formula = ~age*geo*year + eu*year,  
                              freqVar = "freq", 
                              maxN = 1, 
                              protectZeros = FALSE)
head(output)

\

Table 8: formula = ~age*geo*year + eu*year, maxN = 1, protectZeros = FALSE

P(data = dataset, # caption = "Table 8",
                    formula = ~age*geo*year + eu*year,
                    freqVar = "freq", 
                    maxN = 1,
                    protectZeros = FALSE)

\

In this output, young:2016:Spain is suppressed due to the standard handling of the singleton problem.

However, by using singletonMethod = "none" in this case, young:2016:Spain will not be suppressed. Then the sum of young:2016:Iceland and young:2016:Portugal can easily be calculated to be two. Since zeros are never suppressed, the only possible values for these two cells are two ones.



Try the GaussSuppression package in your browser

Any scripts or data that you put into this service are public.

GaussSuppression documentation built on Sept. 24, 2024, 5:07 p.m.