PLSrounding: PLS inspired rounding

View source: R/PLSrounding.R

PLSroundingR Documentation

PLS inspired rounding

Description

Small count rounding of necessary inner cells are performed so that all small frequencies of cross-classifications to be published (publishable cells) are rounded. The publishable cells can be defined from a model formula, hierarchies or automatically from data.

Usage

PLSrounding(
  data,
  freqVar = NULL,
  roundBase = 3,
  hierarchies = NULL,
  formula = NULL,
  dimVar = NULL,
  maxRound = roundBase - 1,
  printInc = nrow(data) > 1000,
  output = NULL,
  extend0 = FALSE,
  preAggregate = is.null(freqVar),
  aggregatePackage = "base",
  aggregateNA = TRUE,
  aggregateBaseOrder = FALSE,
  rowGroupsPackage = aggregatePackage,
  ...
)

PLSroundingInner(..., output = "inner")

PLSroundingPublish(..., output = "publish")

Arguments

data

Input data as a data frame (inner cells)

freqVar

Variable holding counts (inner cells frequencies). When NULL (default), microdata is assumed.

roundBase

Rounding base

hierarchies

List of hierarchies

formula

Model formula defining publishable cells

dimVar

The main dimensional variables and additional aggregating variables. This parameter can be useful when hierarchies and formula are unspecified.

maxRound

Inner cells contributing to original publishable cells equal to or less than maxRound will be rounded

printInc

Printing iteration information to console when TRUE

output

Possible non-NULL values are "input", "inner" and "publish". Then a single data frame is returned.

extend0

When extend0 is set to TRUE, the data is automatically extended. This is relevant when zeroCandidates = TRUE (see RoundViaDummy). Additionally, extend0 can be specified as a list, representing the varGroups parameter in the Extend0 function. Can also be set to "all" which means that input codes in hierarchies are considered in addition to those in data.

preAggregate

When TRUE, the data will be aggregated beforehand within the function by the dimensional variables.

aggregatePackage

Package used to preAggregate. Parameter pkg to aggregate_by_pkg.

aggregateNA

Whether to include NAs in the grouping variables while preAggregate. Parameter include_na to aggregate_by_pkg.

aggregateBaseOrder

Parameter base_order to aggregate_by_pkg, used when preAggregate. The default is set to FALSE to avoid unnecessary sorting operations. When TRUE, an attempt is made to return the same result with data.table as with base R. This cannot be guaranteed due to potential variations in sorting behavior across different systems.

rowGroupsPackage

Parameter pkg to RowGroups. The parameter is input to Formula2ModelMatrix via ModelMatrix.

...

Further parameters sent to RoundViaDummy

Details

This function is a user-friendly wrapper for RoundViaDummy with data frame output and with computed summary of the results. See RoundViaDummy for more details.

Value

Output is a four-element list with class attribute "PLSrounded", which ensures informative printing and enables the use of FormulaSelection on this object.

inner

Data frame corresponding to input data with the main dimensional variables and with cell frequencies (original, rounded, difference).

publish

Data frame of publishable data with the main dimensional variables and with cell frequencies (original, rounded, difference).

metrics

A named character vector of various statistics calculated from the two output data frames ("inner_" used to distinguish). See examples below and the function HDutility.

freqTable

Matrix of frequencies of cell frequencies and absolute differences. For example, row "rounded" and column "inn.4+" is the number of rounded inner cell frequencies greater than or equal to 4.

References

Langsrud, Ø. and Heldal, J. (2018): “An Algorithm for Small Count Rounding of Tabular Data”. Presented at: Privacy in statistical databases, Valencia, Spain. September 26-28, 2018. https://www.researchgate.net/publication/327768398_An_Algorithm_for_Small_Count_Rounding_of_Tabular_Data

See Also

RoundViaDummy, PLS2way, ModelMatrix

Examples

# Small example data set
z <- SmallCountData("e6")
print(z)

# Publishable cells by formula interface
a <- PLSrounding(z, "freq", roundBase = 5,  formula = ~geo + eu + year)
print(a)
print(a$inner)
print(a$publish)
print(a$metrics)
print(a$freqTable)

# Using FormulaSelection()
FormulaSelection(a$publish, ~eu + year)
FormulaSelection(a, ~eu + year) # same as above
FormulaSelection(a)             # just a$publish

# Recalculation of maxdiff, HDutility, meanAbsDiff and rootMeanSquare
max(abs(a$publish[, "difference"]))
HDutility(a$publish[, "original"], a$publish[, "rounded"])
mean(abs(a$publish[, "difference"]))
sqrt(mean((a$publish[, "difference"])^2))

# Six lines below produce equivalent results 
# Ordering of rows can be different
PLSrounding(z, "freq") # All variables except "freq" as dimVar  
PLSrounding(z, "freq", dimVar = c("geo", "eu", "year"))
PLSrounding(z, "freq", formula = ~eu * year + geo * year)
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eHrc"))
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eDimList"))
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eDimList"), formula = ~geo * year)

# Define publishable cells differently by making use of formula interface
PLSrounding(z, "freq", formula = ~eu * year + geo)

# Define publishable cells differently by making use of hierarchy interface
eHrc2 <- list(geo = c("EU", "@Portugal", "@Spain", "Iceland"), year = c("2018", "2019"))
PLSrounding(z, "freq", hierarchies = eHrc2)

# Also possible to combine hierarchies and formula
PLSrounding(z, "freq", hierarchies = SmallCountData("eDimList"), formula = ~geo + year)

# Single data frame output
PLSroundingInner(z, "freq", roundBase = 5, formula = ~geo + eu + year)
PLSroundingPublish(z, roundBase = 5, formula = ~geo + eu + year)

# Microdata input
PLSroundingInner(rbind(z, z), roundBase = 5, formula = ~geo + eu + year)

# Zero perturbed due to both  extend0 = TRUE and zeroCandidates = TRUE 
set.seed(12345)
PLSroundingInner(z[sample.int(5, 12, replace = TRUE), 1:3], 
                 formula = ~geo + eu + year, roundBase = 5, 
                 extend0 = TRUE, zeroCandidates = TRUE, printInc = TRUE)

# Parameter avoidHierarchical (see RoundViaDummy and ModelMatrix) 
PLSroundingPublish(z, roundBase = 5, formula = ~geo + eu + year, avoidHierarchical = TRUE)

# Package sdcHierarchies can be used to create hierarchies. 
# The small example code below works if this package is available. 
if (require(sdcHierarchies)) {
  z2 <- cbind(geo = c("11", "21", "22"), z[, 3:4], stringsAsFactors = FALSE)
  h2 <- list(
    geo = hier_compute(inp = unique(z2$geo), dim_spec = c(1, 1), root = "Tot", as = "df"),
    year = hier_convert(hier_create(root = "Total", nodes = c("2018", "2019")), as = "df"))
  PLSrounding(z2, "freq", hierarchies = h2)
}

# Use PLS2way to produce tables as in Langsrud and Heldal (2018) and to demonstrate 
# parameters maxRound, zeroCandidates and identifyNew (see RoundViaDummy).   
# Parameter rndSeed used to ensure same output as in reference.
exPSD <- SmallCountData("exPSD")
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, rndSeed=124)
PLS2way(a, "original")  # Table 1
PLS2way(a)  # Table 2
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, identifyNew = FALSE, rndSeed=124)
PLS2way(a)  # Table 3
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, maxRound = 7)
PLS2way(a)  # Values in col1 rounded
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, zeroCandidates = TRUE)
PLS2way(a)  # (row3, col4): original is 0 and rounded is 5

SmallCountRounding documentation built on Oct. 22, 2024, 5:06 p.m.