Magnitude table suppression"
In GaussSuppression: Tabular Data Suppression using Gaussian Elimination

options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

htmltables <- TRUE
if (htmltables) {
    source("GaussKable.R")
    source("KableMagnitudeTable.R")
    P <- function(...) G(timevar = "geo", ...)
    M <- function(...) KableMagnitudeTable(..., numVar = "value", timevar = "geo", singletonMethod = "none")
} else {
    P <- function(...) cat("Formatted table not avalable")
    M <- P
}

Introduction

The GaussSuppression package contains several easy-to-use wrapper functions and in this vignette we will look at the SuppressFewContributors and SuppressDominantCells functions. In these functions, primary suppression is based on the number of contributors or by dominance rules. Then, as always in this package, secondary suppression is performed using the Gauss method.

We begin by loading a dataset to be used below.

library(GaussSuppression)
dataset <- SSBtoolsData("magnitude1")
dataset

We can imagine the figures in the variable "value" represent sales value to different sectors from different companies. In the first examples, we will not use the "company" variable, but instead assume that each row represents the contribution of a unique company. Our input data can then be reformatted and illustrated like this:

M(caption = '**Table 1**: Input data with the 20 contributions.',
  dataset, formula = ~sector4:geo-1)

Initial basic examples

Using few contributors primary suppression

In the first example, we use SuppressFewContributors with maxN = 1. This means that cells based on a single contributor are primary suppressed.

SuppressFewContributors(data=dataset, 
                        numVar = "value", 
                        dimVar= c("sector4", "geo"), 
                        maxN=1)

In the output, the number of contributors is in columns nRule and nAll. The two columns are equal under normal usage.

A formatted version of this output is given in Table 2 below. Primary suppressed cells are underlined and labeled in red, while the secondary suppressed cells are labeled in purple.

P(caption = '**Table 2**: Output from `SuppressFewContributors` with `maxN = 1` (number of contributors in parenthesis)',
  data=dataset, 
  numVar = "value", 
  dimVar= c("sector4", "geo"), 
  maxN = 1,
  fun = SuppressFewContributors,
  print_expr = 'paste0(value, " (",nAll ,") ")')

Using dominant cell primary suppression

In the second example, we use SuppressDominantCells with n = 1 and k = 80. This means that aggregates are primary suppressed whenever the largest contribution exceeds 80% of the cell total.

SuppressDominantCells(data=dataset, 
                      numVar = "value", 
                      dimVar= c("sector4", "geo"), 
                      n = 1, k = 80, allDominance = TRUE)

To incorporate the percentage of the two largest contributions in the output, the parameter allDominance = TRUE was utilized. A formatted version of this output is given in Table 3 below.

P(caption = '**Table 3**: Output from `SuppressDominantCells` <br> with `n = 1` and `k = 80`  <br> (percentage from largest contribution in parenthesis)',
  data=dataset, 
  numVar = "value", 
  dimVar= c("sector4", "geo"), 
  n=1, k=80, allDominance = TRUE,
  fun = SuppressDominantCells,
  print_expr = 'paste0(value, " (",round(100*`dominant1`) ,"%) ")')

Note that this table, as well as Table 2, is discussed below in the section on the singleton problem. \ \

An hierarchical table and two dominance rules

Here we use SuppressDominantCells with n = 1:2 and k = c(80, 99). This means that aggregates are primary suppressed whenever the largest contribution exceeds 80% of the cell total or when the two largest contributions exceed 99% of the cell total.

In addition, the example below is made even more advanced by including the variables "sector2" and "eu".

output <- SuppressDominantCells(data=dataset, 
                                numVar = "value", 
                                dimVar= c("sector4", "sector2", "geo", "eu"), 
                                n = 1:2, k = c(80, 99))
head(output)

P(caption = '**Table 4**: Output from `SuppressDominantCells` <br> with `n = 1:2` and `k = c(80, 99)`',
  data=dataset, 
  numVar = "value", 
  dimVar= c("sector4", "sector2", "geo", "eu"), 
  n = 1:2, k = c(80, 99), 
  fun = SuppressDominantCells,
  print_expr = 'value')

As described in the define-tables vignette hierarchies are here detected automatically. The same output is obtained if we first generate hierarchies by:

dimlists <- SSBtools::FindDimLists(dataset[c("sector4", "sector2", "geo", "eu")])
dimlists

And thereafter run SuppressDominantCells with these hierarchies as input:

output <- SuppressDominantCells(data=dataset, 
                                numVar = "value", 
                                hierarchies = dimlists,  
                                n = 1:2, k = c(80, 99))

\ \

Using the formula interface

Using the formula interface is one way to achieve fewer cells in the output. Below we use SuppressFewContributors with maxN = 2. This means that table cells based on one or two contributors are primary suppressed.

output <- SuppressFewContributors(data=dataset, 
                                  numVar = "value", 
                                  formula = ~sector2*geo + sector4*eu, 
                                  maxN=2,
                                  removeEmpty = FALSE)
head(output)
tail(output)

In the formatted version of this output, blank cells indicate that they are not included in the output.

P(caption = '**Table 5**: Output from `SuppressFewContributors` with `maxN = 2` <br> (number of contributors in parenthesis)',
                                  data=dataset, 
                                  numVar = "value", 
                                  formula = ~sector2*geo + sector4*eu, 
                                  maxN=2,
                                  removeEmpty = FALSE,
                                  fun = SuppressFewContributors,
                                  print_expr = 'paste0(value, " (",nAll ,") ")')

Please note that in order to include the three empty cells with no contributors, the removeEmpty parameter was set to FALSE. By default, this parameter is set to TRUE when using the formula interface. \ \

Using the contributorVar parameter (`contributorVar = "company"`)

According to the "company" variable in the data set, there are only four contribution companies (A, B, C and D). We specify this using the contributorVar parameter, which corresponds to a variable within the dataset, in this case, "company". In general, this variable refers to the holding information to be used by the suppression method. When this is taken into account, the primary suppression rules will be applied to data that, within each cell, is aggregated within each contributor. Our example data aggregated in this way is shown below.

M(caption = '**Table 6**: The "value" data aggregated according to hierarchy and contributor',
  dataset, dimVar = c("sector4", "sector2", "geo", "eu"),   contributorVar = "company")

Below we take into account contributor IDs when using few contributors primary suppression.

output <- SuppressFewContributors(data=dataset, 
                                  numVar = "value", 
                                  dimVar = c("sector4", "sector2", "geo", "eu"),
                                  maxN=2,
                                  contributorVar = "company")
head(output)

P(caption = '**Table 7**: Output from `SuppressFewContributors` with `maxN = 2` and with `contributorVar = "company"` (number contributors in parenthesis)',
                                  data=dataset, 
                                  numVar = "value", 
                                  dimVar = c("sector4", "sector2", "geo", "eu"), 
                                  maxN=2,
                                  contributorVar = "company",
                                  fun = SuppressFewContributors,
                                  print_expr = 'paste0(value, " (",nAll ,") ")')

Below we take into account contributor IDs when using dominant cell primary suppression.

output <- SuppressDominantCells(data=dataset, 
                                numVar = "value", 
                                formula = ~sector2*geo + sector4*eu,
                                contributorVar = "company",
                                n = 1:2, k = c(80, 99))
head(output)

Here we have also made use of the formula interface.

P(caption = '**Table 8**: Output from `SuppressDominantCells`  with `n = 1:2` and <br> `k = c(80, 99)` and with `contributorVar = "company"`',
  data=dataset, 
  numVar = "value", 
  formula = ~sector2*geo + sector4*eu,
  contributorVar = "company",
  n = 1:2, k = c(80, 99), 
  fun = SuppressDominantCells,
  print_expr = 'value')

\ \

The singleton problem

Below, the data is suppressed in the same way as in Table 7, but with a different formula. \

output <- SuppressDominantCells(data=dataset,
                                numVar = "value", 
                                formula = ~sector4*geo + sector2*eu,
                                contributorVar = "company",
                                n = 1:2, k = c(80, 99))
head(output)

P(caption = '**Table 9**: Output from `SuppressDominantCells`  with `n = 1:2` and `k = c(80, 99)` and with `contributorVar = "company"`',
  data=dataset, 
  numVar = "value", 
  formula = ~sector4*geo + sector2*eu,
  contributorVar = "company",
  n = 1:2, k = c(80, 99), 
  fun = SuppressDominantCells,
  print_expr = 'value')

By using singletonMethod = "none" in this case, Entertainment:Spain will not be suppressed. This cell is suppressed due to the default handling of the singleton problem. The reason is that Entertainment:Iceland has a single contributor. This contributor can reveal Entertainment:Portugal if Entertainment:Spain is not suppressed.

Here it might appear that the table contains another issue, Entertainment:Iceland can reveal Industry:Iceland. However, this can be considered ok in this case. Industry:Iceland is secondary suppressed and the only reason for this is to protect Entertainment:Iceland.

Nevertheless, in most cases, secondary suppressed cells introduce further complexity to the handling of singletons. A part of the singleton handling of magnitude tables is to add virtual primary suppressed cells prior to the secondary suppression algorithm. Secondary suppressed cells cannot, therefore, be treated in this way. However, another part of the singleton handling solves many of the remaining problems. This is done within the suppression algorithm.

Here we can observe the effect of this in Tables 2 and 3. By using singletonMethod = "none" in Table 2, Industry:Portugal will not be suppressed. In that case, Industry:Spain can reveal Industry:Iceland and consequently Entertainment:Iceland. In Table 3, Governmental:Total and Industry:Total are suppressed due to advanced singleton handling. In fact, by using singletonMethod = "none", all tables above will be suppressed differently.

Intervals

Intervals for the primarily suppressed cells are computed whenever the lpPackage parameter is specified. Additionally, if rangePercent and/or rangeMin are provided, further suppression is performed to ensure that the interval width requirements are met. See the documentation for the lpPackage parameter in ?GaussSuppressionFromData for more details.

In the example below, the interval widths for output lines 12, 14, 17, and 20 are 60.7, which is too narrow since a minimum width of 80 is required. Therefore, the cell on line 16 is additionally suppressed

if (!requireNamespace("Rglpk", quietly = TRUE)) {
  cat("Note: The final part of this vignette requires the suggested package 'Rglpk', which is not installed. That part has been skipped.\n")
  knitr::knit_exit()
}

output <- SuppressDominantCells(data = dataset, 
                                numVar = "value", 
                                dimVar = c("geo", "sector2", "sector4"), 
                                pPercent = 30, 
                                rangePercent = 70, rangeMin = 80, 
                                lpPackage = "Rglpk")
output[c(12, 14, 16, 17, 20, 25), ] # some output lines

In the formatted output shown in Table 10 below, the final intervals are given in parentheses.

P(caption = '**Table 10**: Output from `SuppressDominantCells`  with `pPercent = 30`, `rangePercent = 70`, `rangeMin = 80`, `lpPackage = "Rglpk"`',
  data=dataset, 
  numVar = "value", 
  dimVar = c("geo", "sector2", "sector4"), 
  pPercent = 30, 
  rangePercent = 70, rangeMin = 80, 
  lpPackage = "Rglpk",
  fun = SuppressDominantCells, 
  print_expr = 'ifelse(is.na(lo), value, paste0(value, " [", lo, ", ", up, "]"))')