Defining Tables for GaussSuppression"
In GaussSuppression: Tabular Data Suppression using Gaussian Elimination

options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

htmltables <- TRUE
if(htmltables){
  source("GaussKable.R")
  P <- function(..., timevar = 2) G(fun = GaussSuppressionFromData, timevar = timevar, 
                       freqVar = "freq", primary = FALSE, protectZeros = FALSE,
                         s = c(LETTERS, "county-1", "county-2", "county-3", "small", "BIG",
                               "other", "wages", "assistance", "pensions"), 
                       ...) 
} else { 
  P <- function(...) cat("Formatted table not avalable")
}

Introduction and setup

The GaussSuppression package uses a common interface shared by other SDC packages developed at Statistics Norway (see also SmallCountRounding and SSBcellKey). In the background, these packages use a model matrix representation, which connects the input data to the intended output. This functionality is provided by the R package SSBtools. In this vignette, we look at multiple ways of specifying output tables given different forms of input. Note that this vignette only scratches the surface of what is possible with the provided interface, and rather is intended to help users get going with the package.

We begin by importing the necessary dependencies as well as loading a test data set provided in the SSBtools package.

library(SSBtools)
library(GaussSuppression)

dataset <- SSBtools::SSBtoolsData("d2s")

microdata <- SSBtools::MakeMicro(dataset, "freq")

head(dataset)
nrow(dataset)
head(microdata)
nrow(microdata)

The imported data set is a fictitious data set containing the variables: r names(dataset), where region, county, and size are different (non-nested) regional hierarchies. GaussSuppression can take microdata as input as well, which we will demonstrate in the following sections.

The table below illustrates this dataset reshaped to wide format with several freq columns created from the main_income variable. However, please note that data that is input and output in the GaussSuppression package is always in long format.

d2ws <- SSBtools::SSBtoolsData("d2ws")
KableTable(caption = '**Table 1**:  `dataset` reshaped to wide format.',
  data = d2ws, nvar = 3, header = c("regional variables", "main_income"))

Defining Table Dimensions

Output tables are mainly specified using the following three parameters: dimVar, hierarchies, and formula.

Creating tables using `dimVar`

The most basic way of defining output tables is by using the dimVar parameter. This generates by default all combinations of the variables provided, including marginals. For example, the following function call creates a one dimensional frequency table over the variable region.

GaussSuppressionFromData(data = dataset,
                         dimVar = "region",
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

The same output is shown below as a formatted table. \ \

Table 2: dimVar = "region"

P(caption = NULL, #caption = '**Table 2**: `dimVar = "region"`',
  data=dataset, 
  dimVar = "region")

Note the use of the function GaussSuppressionFromData and the inclusion of two parameters primary and protectZeros. The functions in GaussSuppression are designed to incorporate both table building and protection into a single function call. Thus, to illustrate the table building features, we have set that nothing must be protected. To learn more about the different ways of protecting tables, see the other vignettes of this package.

In a similar fashion, we can include multiple variables in the dimVar parameter:

GaussSuppressionFromData(data = dataset,
                         dimVar = c("region", "main_income"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

The same output is shown below as a formatted and reshaped table. Cells that also occur as input/inner cells have white background. \ \

P(caption = '**Table 3**: `dimVar = c("region", "main_income")`',
  data=dataset, 
  dimVar = c("region", "main_income"))

Note in particular what happens when we provide two regional variables:

GaussSuppressionFromData(data = dataset,
                         dimVar = c("region", "county"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Table 4: dimVar = c("region", "county")

P(caption = NULL, # caption = '**Table 4**: `dimVar = c("region", "county")`',
  data=dataset, 
  dimVar = c("region", "county"))

The function detects hierarchies encoded in dimVar columns, and collapses them into a single column (with the name of the most detailed variable). In this way, it is not necessary to specify hierarchies by hand and include them explicitly in the function call. This also works for non-nested hierarchies:

GaussSuppressionFromData(data = dataset,
                         dimVar = c("region", "county", "size"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Table 5: dimVar = c("region", "county", "size")

P(caption = NULL, # caption = '**Table 4**: `dimVar = c("region", "county", "size")`',
  data=dataset, 
  dimVar = c("region", "county", "size"))

We can combine all the dimensional variables in our example data:

output <- GaussSuppressionFromData(data = dataset,
                                   dimVar = c("region", "county", "size", "main_income"),
                                   freqVar = "freq",
                                   primary = FALSE,
                                   protectZeros = FALSE)
head(output)

Table 6: dimVar = c("region", "county", "size", "main_income")

P(caption = NULL,  # caption = '**Table 6**: `dimVar = c("region", "county", "size", "main_income")` ',
  data=dataset, 
  dimVar = c("region", "county", "size", "main_income"))

\ In the background, functions from SSBtools are used to find the hierarchies. There are multiple ways of inspecting which hierarchies can be found; users familiar with DimLists used in other SDC packages can for example use the following:

FindDimLists(dataset[c("region", "county")])
FindDimLists(dataset[c("region", "county", "size")])

Note the last example which contained non-nested hierarchies. Here, a unique DimList is created for each tree-shaped hierarchy in the data set. This avoids the need for specifying non-nested hierarchies as linked tables.

Finally, for illustration purposes, we see that the same function calls work with microdata as input:

GaussSuppressionFromData(data = microdata,
                         dimVar = c("region", "county", "size"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

This output is the same as illustrated in Table 5 above.

Creating tables using `hierarchies`

The hierarchies parameter allows the explicit specification of which hierarchies should be used when creating the output table. This allows for a more fine-grained approach as opposed to simply using dimVar, as it allows for applying hierarchies not already present in the data set. Hierarchies can be provided in many ways. In this vignette, we will exemplify the following three forms: as a dimlist (as defined in sdcTable), using the hrc format from TauArgus, and finally with a more general hierarchy specification (internally, not surprisingly, simply called hierarchy). Any of these can be provided to the hierarchies parameter, as they are all translated to the internal hierarchy representation. For the purposes of this vignette, we will use dimlists, however in the following example we shall see how these can be translated to one another using functions from SSBtools. Let us begin by defining two hierarchies by using dimlists:

region_dim <- data.frame(levels = c("@", "@@", rep("@@@", 2), rep("@@", 4)),
                         codes = c("Total", "AB", LETTERS[1:6]))
region_dim

income_dim <- data.frame(levels = c("@", "@@", "@@", "@@@", "@@@", "@@@"),
                         codes = c("Total", "wages", "not_wages", "other", "assistance", "pensions"))
income_dim
SSBtools::DimList2Hrc(income_dim)
SSBtools::DimList2Hierarchy(income_dim)

We can use these hierarchies to specify our output table. We do this by supplying a named list to the hierarchies parameter, where the list names correspond to variables in the data, and the list elements correspond to hierarchies we wish to include.

GaussSuppressionFromData(data = dataset,
                         hierarchies = list(region = region_dim, main_income = income_dim),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Table 7: hierarchies = list(region = region_dim, main_income = income_dim)

P(caption = NULL, #caption = '**Table 7**: `hierarchies = list(region = region_dim, main_income = income_dim)`',
  data=dataset, 
  hierarchies = list(region = region_dim, main_income = income_dim))

\ As mentioned previously, the GaussSuppression package supports non-nested hierarchies natively. We achieve this by having multiple elements with the same name in the hierarchies list:

region2_dim <- data.frame(levels = c("@", rep(c("@@", rep("@@@", 2)), 2), rep("@@", 2)),
                          codes = c("Total", "AD", "A", "D",  
                                    "BF", "B", "F", 
                                    "C", "E"))
region2_dim

GaussSuppressionFromData(data = dataset,
                         hierarchies = list(region = region_dim, region = region2_dim),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Table 8: hierarchies = list(region = region_dim, region = region2_dim)

P(caption = NULL, #caption = '**Table 8**: `hierarchies = list(region = region_dim, region = region2_dim)`',
  data=dataset, 
  hierarchies = list(region = region_dim, region = region2_dim))

Finally, as before, all of this functionality works with microdata as input as well.

GaussSuppressionFromData(data = microdata,
                         hierarchies = list(region = region_dim, region = region2_dim),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Creating tables using `formula`

The most flexible method for specifying the output of GaussSuppression is by using the formula interface. This makes use of model formulas in R, and provides a powerful way of specifying multiple different tables. Indeed, all of the above examples---and much more---can be replicated using the formula interface. The formula's predictor variables must be variable names occuring in the data set (the dependent variable is ignored, and thus we leave it empty). In the following, we create a table based on the region and county variables. As before, the hierarchical relationship between these variables is detected automatically:

GaussSuppressionFromData(data = microdata,
                         formula = ~ region + county,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Table 9: formula = ~ region + county

P(caption =  NULL, # caption = '**Table 9**: `formula = ~ region + county`',
  data=dataset, 
  formula = ~ region + county)

If there is no hierarchical relationship between variables, multiplication in the formula and specification in dimVar yield the same results.

GaussSuppressionFromData(data = microdata,
                         formula = ~ county * main_income,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
  GaussSuppressionFromData(data = microdata,
                         dimVar = c("county" , "main_income"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Table 10: formula = ~ county * main_income or dimVar = c("county" , "main_income")

P(caption =  NULL, 
  data=dataset, 
  formula = ~ county * main_income)

\ However, formula lets us specify different shapes for our tables. For example, if we are only interested in marginal values, we can supply this with the use of the addition operator:

GaussSuppressionFromData(data = microdata,
                         formula = ~ county + main_income,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

The same output is shown below as a formatted and reshaped table where empty cells means cells not included in the output. \

Table 11: formula = ~ county + main_income

P(caption =  NULL, 
  data=dataset, 
  formula = ~ county + main_income)

\ This example demonstrates, in fact, the ability of specifying multiple linked tables: a one-dimensional table for county linked with a one-dimensional table for main_income. Similarly, we can use the colon (":") operator to omit row and column marginals:

GaussSuppressionFromData(data = microdata,
                         formula = ~ county:main_income,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Table 12: formula = ~ county:main_income

P(caption =  NULL, 
  data=dataset, 
  formula = ~ county:main_income)

\ Using subtraction, we can omit marginals and other cells from the output. For example, the intercept (sum over all records) can be omitted by including - 1 in the formula, like this: formula = county : main_income - 1.

Using these features, we can define more complicated linked tables. To illustrate this, let us assume we wish to publish the following:

all levels of detail of main_income for all counties and sizes,
at the region level, we only wish to publish if the main source of income was "wages" or "not_wages".

To do this, we begin by adding a column encoding whether the main source of income was "wages" or "not_wages".

dataset$income2 <- ifelse(dataset$main_income == "wages", "wages", "not_wages")
microdata$income2 <- ifelse(microdata$main_income == "wages", "wages", "not_wages")
head(dataset)

Then we can specify the desired output with the following formula:

GaussSuppressionFromData(data = dataset,
                         formula = ~ region * income2 + (county + size) * main_income,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)

Table 13: formula = ~ region * income2 + (county + size) * main_income

P(caption = NULL, #caption = '**Table 13**: `formula = ~ region * income2 + (county + size) * main_income`',
  data=dataset, 
  formula = ~ region * income2 + (county + size) * main_income)

In this manner, we can specify multiple linked tables, each of which can use different non-nested hierarchies. This allows the suppression algorithm to protect all of these tables simultaneously (indeed, they are treated as a single table internally), avoiding the need for a stratified protection paradigm. Furthermore, the fine-grained specification of which cells are to be published allows the secondary suppression algorithm to protect with respect to precisely those cells that will be published. If row and column marginals are not published, for example, the suppression algorithm does not need to secondary suppress with respect to these marginals. See the other vignettes in this package for more details on setting up the protection methods.

Looking at the output data above Table 13, you will see that row 9 is duplicated on row 18. The reason is that the code wages is used both in the main_income variable and in the income2 variable. Currently, the formula interface does not do any special checking for this phenomenon. The recommended practice is to avoid such duplicate codes. When running FindDimLists, you will see that this function performs checking.

Tabulating continuous variables

In addition to defining the dimensions of the output tables, we need to decide whether they should be frequency tables (where we count contributing records) or magnititude tables (where we add contributing records' numerical values for a given variable). All of the above examples have been frequency tables. However, the process is exactly the same if one wishes to construct magnititude tables; the only difference is that one must specify the numerical variable with the help of the parameter numVar.

Since most magnitude table suppression methods are based on comparing units' contributions, the input data will most likely be supplied as microdata. Therefore, let us add a fake numerical variable to our microdata:

set.seed(12345)
microdata$num <- sample(0:1000, nrow(microdata), replace = TRUE)

Then in order to construct a volume table where records' contributions to num are aggregated, we supply this as a parameter to GaussSuppressionFromData:

GaussSuppressionFromData(data = microdata,
                         formula = ~ region * income2 + (county + size) * main_income,
                         numVar = "num",
                         primary = FALSE,
                         protectZeros = FALSE)

P(caption = '**Table 14**: `formula = ~ region * income2 + (county + size) * main_income` <br> In each cell: `num` with frequencies in parenthesis.',
  data=microdata, 
  formula = ~ region * income2 + (county + size) * main_income,
  numVar = "num",
  print_expr = 'paste0(num, " (", sprintf("%3d",freq) ,") ")')

Note that there are two empty cells in the wages column. This means that these cells are not included in the output data. One reason is that the removeEmpty parameter to SSBtools::ModelMatrix has TRUE as default in the case of a formula interface. By including removeEmpty = FALSE, zeros will be included in the output. Another way to achieve this is to use extend0 = TRUE. By this parameter, zeros are added to the input data after the automatic aggregation from microdata. As you will see in other vignettes in this package, the extend0 parameter can be important for suppression methods.

Note also that a new frequency variable is generated with the above call. If a frequency variable is already present in the input data, we can provide it in addition to numVar and the method will use that information instead:

GaussSuppressionFromData(data = microdata,
                         formula = ~ region * income2 + (county + size) * main_income,
                         freqVar = "freq",
                         numVar = "num",
                         primary = FALSE,
                         protectZeros = FALSE)