PLSrounding: PLS inspired rounding
In SmallCountRounding: Small Count Rounding of Tabular Data

PLSrounding

R Documentation

PLS inspired rounding

Description

Small count rounding of necessary inner cells are performed so that all small frequencies of cross-classifications to be published (publishable cells) are rounded. The publishable cells can be defined from a model formula, hierarchies or automatically from data.

Usage

PLSrounding(
  data,
  freqVar = NULL,
  roundBase = 3,
  hierarchies = NULL,
  formula = NULL,
  dimVar = NULL,
  maxRound = roundBase - 1,
  printInc = nrow(data) > 1000,
  output = NULL,
  extend0 = FALSE,
  preAggregate = is.null(freqVar),
  aggregatePackage = "base",
  aggregateNA = TRUE,
  aggregateBaseOrder = FALSE,
  rowGroupsPackage = aggregatePackage,
  ...
)

PLSroundingInner(..., output = "inner")

PLSroundingPublish(..., output = "publish")

Arguments

`data`	Input data (inner cells), typically a data frame, tibble, or data.table. If `data` is not a classic data frame, it will be coerced to one internally unless `preAggregate` is `TRUE` and `aggregatePackage` is `"data.table"`.
`freqVar`	Variable holding counts (inner cells frequencies). When `NULL` (default), microdata is assumed.
`roundBase`	Rounding base
`hierarchies`	List of hierarchies
`formula`	Model formula defining publishable cells
`dimVar`	The main dimensional variables and additional aggregating variables. This parameter can be useful when hierarchies and formula are unspecified.
`maxRound`	Inner cells contributing to original publishable cells equal to or less than maxRound will be rounded
`printInc`	Printing iteration information to console when TRUE
`output`	Possible non-NULL values are `"input"`, `"inner"` and `"publish"`. Then a single data frame is returned.
`extend0`	When `extend0` is set to `TRUE`, the data is automatically extended. This is relevant when `zeroCandidates = TRUE` (see `RoundViaDummy`). Additionally, `extend0` can be specified as a list, representing the `varGroups` parameter in the `Extend0` function. Can also be set to `"all"` which means that input codes in hierarchies are considered in addition to those in data.
`preAggregate`	When `TRUE`, the data will be aggregated beforehand within the function by the dimensional variables.
`aggregatePackage`	Package used to preAggregate. Parameter `pkg` to `aggregate_by_pkg`.
`aggregateNA`	Whether to include NAs in the grouping variables while preAggregate. Parameter `include_na` to `aggregate_by_pkg`.
`aggregateBaseOrder`	Parameter `base_order` to `aggregate_by_pkg`, used when preAggregate. The default is set to `FALSE` to avoid unnecessary sorting operations. When `TRUE`, an attempt is made to return the same result with `data.table` as with base R. This cannot be guaranteed due to potential variations in sorting behavior across different systems.
`rowGroupsPackage`	Parameter `pkg` to `RowGroups`. The parameter is input to `Formula2ModelMatrix` via `ModelMatrix`.
`...`	Further parameters sent to `RoundViaDummy`

Details

This function is a user-friendly wrapper for RoundViaDummy with data frame output and with computed summary of the results. See RoundViaDummy for more details.

Value

Output is a four-element list with class attribute "PLSrounded", which ensures informative printing and enables the use of FormulaSelection on this object.

`inner`	Data frame corresponding to input data with the main dimensional variables and with cell frequencies (original, rounded, difference).
`publish`	Data frame of publishable data with the main dimensional variables and with cell frequencies (original, rounded, difference).
`metrics`	A named character vector of various statistics calculated from the two output data frames ("`inner_`" used to distinguish). See examples below and the function `HDutility`.
`freqTable`	Matrix of frequencies of cell frequencies and absolute differences. For example, row "`rounded`" and column "`inn.4+`" is the number of rounded inner cell frequencies greater than or equal to `4`.

References

Langsrud, Ø. and Heldal, J. (2018): “An Algorithm for Small Count Rounding of Tabular Data”. Presented at: Privacy in statistical databases, Valencia, Spain. September 26-28, 2018. https://www.researchgate.net/publication/327768398_An_Algorithm_for_Small_Count_Rounding_of_Tabular_Data

Examples

# Small example data set
z <- SmallCountData("e6")
print(z)

# Publishable cells by formula interface
a <- PLSrounding(z, "freq", roundBase = 5,  formula = ~geo + eu + year)
print(a)
print(a$inner)
print(a$publish)
print(a$metrics)
print(a$freqTable)

# Using FormulaSelection()
FormulaSelection(a$publish, ~eu + year)
FormulaSelection(a, ~eu + year) # same as above
FormulaSelection(a)             # just a$publish

# Recalculation of maxdiff, HDutility, meanAbsDiff and rootMeanSquare
max(abs(a$publish[, "difference"]))
HDutility(a$publish[, "original"], a$publish[, "rounded"])
mean(abs(a$publish[, "difference"]))
sqrt(mean((a$publish[, "difference"])^2))

# Five lines below produce equivalent results 
# Ordering of rows can be different
PLSrounding(z, "freq", dimVar = c("geo", "eu", "year"))
PLSrounding(z, "freq", formula = ~eu * year + geo * year)
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eHrc"))
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eDimList"))
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eDimList"), formula = ~geo * year)

# Define publishable cells differently by making use of formula interface
PLSrounding(z, "freq", formula = ~eu * year + geo)

# Define publishable cells differently by making use of hierarchy interface
eHrc2 <- list(geo = c("EU", "@Portugal", "@Spain", "Iceland"), year = c("2018", "2019"))
PLSrounding(z, "freq", hierarchies = eHrc2)

# Also possible to combine hierarchies and formula
PLSrounding(z, "freq", hierarchies = SmallCountData("eDimList"), formula = ~geo + year)

# Single data frame output
PLSroundingInner(z, "freq", roundBase = 5, formula = ~geo + eu + year)
PLSroundingPublish(z, roundBase = 5, formula = ~geo + eu + year)

# Microdata input
PLSroundingInner(rbind(z, z), roundBase = 5, formula = ~geo + eu + year)

# Zero perturbed due to both  extend0 = TRUE and zeroCandidates = TRUE 
set.seed(12345)
PLSroundingInner(z[sample.int(5, 12, replace = TRUE), 1:3], 
                 formula = ~geo + eu + year, roundBase = 5, 
                 extend0 = TRUE, zeroCandidates = TRUE, printInc = TRUE)

# Parameter avoidHierarchical (see RoundViaDummy and ModelMatrix) 
PLSroundingPublish(z, roundBase = 5, formula = ~geo + eu + year, avoidHierarchical = TRUE)


# To illustrate hierarchical_extend0 
#    (parameter to underlying function, SSBtools::Extend0fromModelMatrixInput)
PLSroundingInner(z[-c(2:3), ], roundBase = 5, formula = ~geo + eu + year, 
   avoidHierarchical = TRUE, zeroCandidates = TRUE, extend0 = TRUE)
PLSroundingInner(z[-c(2:3), ], roundBase = 5, formula = ~geo + eu + year, 
   avoidHierarchical = TRUE, zeroCandidates = TRUE, extend0 = TRUE, 
   hierarchical_extend0 = TRUE)

# Package sdcHierarchies can be used to create hierarchies. 
# The small example code below works if this package is available. 
if (require(sdcHierarchies)) {
  z2 <- cbind(geo = c("11", "21", "22"), z[, 3:4], stringsAsFactors = FALSE)
  h2 <- list(
    geo = hier_compute(inp = unique(z2$geo), dim_spec = c(1, 1), root = "Tot", as = "df"),
    year = hier_convert(hier_create(root = "Total", nodes = c("2018", "2019")), as = "df"))
  PLSrounding(z2, "freq", hierarchies = h2)
}

# Use PLS2way to produce tables as in Langsrud and Heldal (2018) and to demonstrate 
# parameters maxRound, zeroCandidates and identifyNew (see RoundViaDummy).   
# Parameter rndSeed used to ensure same output as in reference.
exPSD <- SmallCountData("exPSD")
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, rndSeed=124)
PLS2way(a, "original")  # Table 1
PLS2way(a)  # Table 2
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, identifyNew = FALSE, rndSeed=124)
PLS2way(a)  # Table 3
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, maxRound = 7)
PLS2way(a)  # Values in col1 rounded
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, zeroCandidates = TRUE)
PLS2way(a)  # (row3, col4): original is 0 and rounded is 5

# Using formula followed by FormulaSelection 
output <- PLSrounding(data = SmallCountData("example1"), 
                      formula = ~age * geo * year + eu * year, 
                      freqVar = "freq", 
                      roundBase = 5)
FormulaSelection(output, ~(age + eu) * year)

# Example similar to the one in the documentation of tables_by_formulas,
# but using PLSroundingPublish with roundBase = 4.
tables_by_formulas(SSBtoolsData("magnitude1"),
                   table_fun = PLSroundingPublish, 
                   table_formulas = list(table_1 = ~region * sector2, 
                                         table_2 = ~region1:sector4 - 1, 
                                         table_3 = ~region + sector4 - 1), 
                   substitute_vars = list(region = c("geo", "eu"), region1 = "eu"), 
                   collapse_vars = list(sector = c("sector2", "sector4")), 
                   roundBase = 4)

SmallCountRounding documentation built on April 3, 2025, 6:03 p.m.