cellkey_pkg: R6 Class defining statistical tables that can be perturbed

ck_classR Documentation

R6 Class defining statistical tables that can be perturbed

Description

This class allows to define statistical tables and perturb both count and numerical variables.

Usage

ck_setup(x, rkey, dims, w = NULL, countvars = NULL, numvars = NULL)

Arguments

x

an object coercible to a data.frame

rkey

either a column name within x referring to a variable containing record keys or a single integer(ish) number > 5 that referns to the number of digits for record keys that will be generated internally.

dims

a list containing slots for each variable that should be tabulated. Each slot consists should be created/modified using sdcHierarchies::hier_create(), sdcHierarchies::hier_add() and other functionality from package sdcHierarchies.

w

(character) a scalar character referring to a variable in x holding sampling weights. If w is NULL (the default), all weights are assumed to be 1

countvars

(character) an optional vector containing names of binary (0/1 coded) variables withing x that should be included in the problem instance. These variables can later be perturbed.

numvars

(character) an optional vector of numerical variables that can later be tabulated.

Details

Such objects are typically generated using ck_setup().

Value

A new cellkey_obj object. Such objects (internally) contain the fully computed statistical tables given input microdata (x), the hierarchical definitionals (dims) as well as the remaining inputs. Intermediate results are stored internally and can only be modified / accessed via the exported public methods described below.

Methods

Public methods


Method new()

Create a new table instance

Usage
ck_class$new(x, rkey, dims, w = NULL, countvars = NULL, numvars = NULL)
Arguments
x

an object coercible to a data.frame

rkey

either a column name within x referring to a variable containing record keys or a single integer(ish) number > 5 that referns to the number of digits for record keys that will be generated internally.

dims

a list containing slots for each variable that should be tabulated. Each slot consists should be created/modified using sdcHierarchies::hier_create(), sdcHierarchies::hier_add() and other functionality from package sdcHierarchies.

w

(character) a scalar character referring to a variable in x holding sampling weights. If w is NULL (the default), all weights are assumed to be 1

countvars

(character) an optional vector containing names of binary (0/1 coded) variables withing x that should be included in the problem instance. These variables can later be perturbed.

numvars

(character) an optional vector of numerical variables that can later be tabulated.

Returns

A new cellkey_obj object. Such objects (internally) contain the fully computed statistical tables given input microdata (x), the hierarchical definitionals (dims) as well as the remaining inputs. Intermediate results are stored internally and can only be modified / accessed via the exported public methods described below.


Method perturb()

Perturb a count- or magnitude variable

Usage
ck_class$perturb(v)
Arguments
v

name(s) of count- or magnitude variables that should be perturbed.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. Updated data can be accessed using other exported methods like ⁠$freqtab()⁠ or ⁠$numtab()⁠.


Method freqtab()

Extract results from already perturbed count variables as a data.table

Usage
ck_class$freqtab(v = NULL, path = NULL)
Arguments
v

a vector of variable names for count variables. If NULL (the default), the results are returned for all available count variables. For variables that have not yet perturbed, columns puwc and pwc are filled with NA.

path

if not NULL, a scalar character defining a (relative or absolute) path to which the result table should be written. A csv file will be generated and, if specified, path must have ".csv" as file-ending

Returns

This method returns a data.table containing all combinations of the dimensional variables in the first n columns. Additionally, the following columns are shown:

  • vname: name of the perturbed variable

  • uwc: unweighted counts

  • wc: weighted counts

  • puwc: perturbed unweighted counts or NA if vname was not yet perturbed

  • pwc: perturbed weighted counts or NA if vname was not yet perturbed


Method numtab()

Extract results from already perturbed continuous variables as a data.table.

Usage
ck_class$numtab(v = NULL, mean_before_sum = FALSE, path = NULL)
Arguments
v

a vector of variable names of continuous variables. If NULL (the default), the results are returned for all available numeric variables.

mean_before_sum

(logical); if TRUE, the perturbed values are adjusted by a factor ⁠((n+p))⁄n⁠ with

  • n: the original weighted cell value

  • p: the perturbed cell value

This makes sense if the the accuracy of the variable mean is considered to be more important than accuracy of sums of the variable. The default value is FALSE (no adjustment is done)

path

if not NULL, a scalar character defining a (relative or absolute) path to which the result table should be written. A csv file will be generated and, if specified, path must have ".csv" as file-ending

Returns

This method returns a data.table containing all combinations of the dimensional variables in the first n columns. Additionally, the following columns are shown:

  • vname: name of the perturbed variable

  • uws: unweighted sum of the given variable

  • ws: weighted cellsum

  • pws: perturbed weighted sum of the given cell or NA if vname has not not perturbed


Method measures_cnts()

Utility measures for perturbed count variables

Usage
ck_class$measures_cnts(v, exclude_zeros = TRUE)
Arguments
v

name of a count variable for which utility measures should be computed.

exclude_zeros

should empty (zero) cells in the original values be excluded when computing distance measures

Returns

This method returns a list containing a set of utility measures based on some distance functions. For a detailed description of the computed measures, see ck_cnt_measures()


Method measures_nums()

Utility measures for continuous variables (not yet implemented)

Usage
ck_class$measures_nums(v)
Arguments
v

name of a continuous variable for which utility measures should be computed.

Returns

for (now) an empty list; In future versions of the package, the Method will return utility measures for perturbed magnitude tables.


Method allvars()

Names of variables that can be perturbed / tabulated

Usage
ck_class$allvars()
Returns

returns a list with the following two elements:

  • cntvars: character vector with names of available count variables for perturbation

  • numvars: character vector with names of available numerical variables for perturbation


Method cntvars()

Names of count variables that can be perturbed

Usage
ck_class$cntvars()
Returns

a character vector containing variable names


Method numvars()

Names of continuous variables that can be perturbed

Usage
ck_class$numvars()
Returns

a character vector containing variable names


Method hierarchy_info()

Information about hierarchies

Usage
ck_class$hierarchy_info()
Returns

a list (for each dimensional variable) with information on the hierarchies. This may be used to restrict output tables to specific levels or codes. Each list element is a data.table containing the following variables:

  • code: the name of a code within the hierarchy

  • level: number defining the level of the code; the higher the number, the lower the hierarchy with 1 being the overall total

  • is_leaf: if TRUE, this code is a leaf node which means no other codes contribute to it

  • parent: name of the parent code


Method mod_cnts()

Modifications applied to count variables

Usage
ck_class$mod_cnts()
Returns

a data.table containing modifications applied to count variables


Method mod_nums()

Modifications applied to numerical variables

Usage
ck_class$mod_nums()
Returns

a data.table containing modifications applied to numerical variables


Method supp_freq()

Identify sensitive cells based on minimum frequency rule

Usage
ck_class$supp_freq(v, n, weighted = TRUE)
Arguments
v

a single variable name of a continuous variable (see method numvars())

n

a number defining the threshold. All cells ⁠<= n⁠ are considered as unsafe.

weighted

if TRUE, the weighted number of contributors to a cell are compared to the threshold specified in n (default); else the unweighted number of contributors is used.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠).


Method supp_val()

Identify sensitive cells based on weighted or unweighted cell value

Usage
ck_class$supp_val(v, n, weighted = TRUE)
Arguments
v

a single variable name of a continuous variable (see method numvars())

n

a number defining the threshold. All cells ⁠<= n⁠ are considered as unsafe.

weighted

if TRUE, the weighted cell value of variable v is compared to the threshold specified in n (default); else the unweighted number is used.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠).


Method supp_cells()

Identify sensitive cells based on their names

Usage
ck_class$supp_cells(v, inp)
Arguments
v

a single variable name of a continuous variable (see method numvars())

inp

a data.frame where each colum represents a dimensional variable. Each row of this input is then used to compute the relevant cells to be identified as sensitive where NA-values are possible and used to match any characteristics of the dimensional variable.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠).


Method supp_p()

Identify sensitive cells based on the p%-rule rule. Please note that this rule can only be applied to positive-only variables.

Usage
ck_class$supp_p(v, p)
Arguments
v

a single variable name of a continuous variable (see method numvars())

p

a number defining a percentage between 1 and 99.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠).


Method supp_pq()

Identify sensitive cells based on the pq-rule. Please note that this rule can only be applied to positive-only variables.

Usage
ck_class$supp_pq(v, p, q)
Arguments
v

a single variable name of a continuous variable (see method numvars())

p

a number defining a percentage between 1 and 99.

q

a number defining a percentage between 1 and 99. This value must be larger than p.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠).


Method supp_nk()

Identify sensitive cells based on the nk-dominance rule. Please note that this rule can only be applied to positive-only variables.

Usage
ck_class$supp_nk(v, n, k)
Arguments
v

a single variable name of a continuous variable (see method numvars())

n

an integerish number ⁠>= 2⁠

k

a number defining a percentage between 1 and 99. All cells to which the top n contributers contribute more than ⁠k%⁠ is considered unsafe

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠).


Method params_cnts_get()

Return perturbation parameters of count variables

Usage
ck_class$params_cnts_get()
Returns

a named list in which each list-element contains the active perturbation parameters for the specific count variable defined by the list-name.


Method params_cnts_set()

Set perturbation parameters for count variables

Usage
ck_class$params_cnts_set(val, v = NULL)
Arguments
val

a perturbation object created with ck_params_cnts()

v

a character vector (or NULL). If NULL (the default), the perturbation parameters provided in val are set for all count variables; otherwise one may specify the names of the count variables for which the parameters should be set.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠).


Method reset_cntvars()

reset results and parameters for already perturbed count variables

Usage
ck_class$reset_cntvars(v = NULL)
Arguments
v

if v equals NULL (the default), the results are reset for all perturbed count variables; otherwise it is possible to specify the names of already perturbed count variables.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠ or ⁠$freqtab()⁠).


Method reset_numvars()

reset results and parameters for already perturbed numerical variables

Usage
ck_class$reset_numvars(v = NULL)
Arguments
v

if v equals NULL (the default), the results are reset for all perturbed numerical variables; otherwise it is possible to specify the names of already perturbed continuous variables.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠ or ⁠$numtab()⁠).


Method reset_allvars()

reset results and parameters for all already perturbed variables.

Usage
ck_class$reset_allvars()
Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠, ⁠$freqtab()⁠ or ⁠$numtab()⁠).


Method params_nums_get()

Return perturbation parameters of continuous variables

Usage
ck_class$params_nums_get()
Returns

a named list in which each list-element contains the active perturbation parameters for the specific continuous variable defined by the list-name.


Method params_nums_set()

set perturbation parameters for continuous variables.

Usage
ck_class$params_nums_set(val, v = NULL)
Arguments
val

a perturbation object created with ck_params_nums()

v

a character vector (or NULL); if NULL (the default), the perturbation parameters provided in val are set for all continuous variables; otherwise one may specify the names of the numeric variables for which the parameters should be set.

Returns

A modified cellkey_obj object in which private slots were updated for side-effects. These updated values are used by other methods (e.g ⁠$perturb()⁠).


Method summary()

some aggregated summary statistics about perturbed variables

Usage
ck_class$summary()
Returns

invisible NULL


Method print()

prints information about the current table

Usage
ck_class$print()
Returns

invisible NULL

Examples


x <- ck_create_testdata()

# create some 0/1 variables that should be perturbed later
x[, cnt_females := ifelse(sex == "male", 0, 1)]
x[, cnt_males := ifelse(sex == "male", 1, 0)]
x[, cnt_highincome := ifelse(income >= 9000, 1, 0)]
# a variable with positive and negative contributions
x[, mixed := sample(-10:10, nrow(x), replace = TRUE)]

# create record keys
x$rkey <- ck_generate_rkeys(dat = x)

# define required inputs

# hierarchy with some bogus codes
d_sex <- hier_create(root = "Total", nodes = c("male", "female"))
d_sex <- hier_add(d_sex, root = "female", "f")
d_sex <- hier_add(d_sex, root = "male", "m")

d_age <- hier_create(root = "Total", nodes = paste0("age_group", 1:6))
d_age <- hier_add(d_age, root = "age_group1", "ag1a")
d_age <- hier_add(d_age, root = "age_group2", "ag2a")

# define the cell key object
countvars <- c("cnt_females", "cnt_males", "cnt_highincome")
numvars <- c("expend", "income", "savings", "mixed")
tab <- ck_setup(
  x = x,
  rkey = "rkey",
  dims = list(sex = d_sex, age = d_age),
  w = "sampling_weight",
  countvars = countvars,
  numvars = numvars)

# show some information about this table instance
tab$print() # identical with print(tab)

# information about the hierarchies
tab$hierarchy_info()

# which variables have been defined?
tab$allvars()

# count variables
tab$cntvars()

# continuous variables
tab$numvars()

# create perturbation parameters for "total" variable and
# write to yaml-file

# create a ptable using functionality from the ptable-pkg
f_yaml <- tempfile(fileext = ".yaml")
p_cnts1 <- ck_params_cnts(
  ptab = ptable::pt_ex_cnts(),
  path = f_yaml)

# read parameters from yaml-file and set them for variable `"total"`
p_cnts1 <- ck_read_yaml(path = f_yaml)

tab$params_cnts_set(val = p_cnts1, v = "total")

# create alternative perturbation parameters by specifying parameters
para2 <- ptable::create_cnt_ptable(
  D = 8, V = 3, js = 2, create = FALSE)

p_cnts2 <- ck_params_cnts(ptab = para2)

# use these ptable it for the remaining variables
tab$params_cnts_set(val = p_cnts2, v = countvars)

# perturb a variable
tab$perturb(v = "total")

# multiple variables can be perturbed as well
tab$perturb(v = c("cnt_males", "cnt_highincome"))

# return weighted and unweighted results
tab$freqtab(v = c("total", "cnt_males"))

# numerical variables (positive variables using flex-function)
# we also write the config to a yaml file
f_yaml <- tempfile(fileext = ".yaml")

# create a ptable using functionality from the ptable-pkg
# a single ptable for all cells
ptab1 <- ptable::pt_ex_nums(parity = TRUE, separation = FALSE)

# a single ptab for all cells except for very small ones
ptab2 <- ptable::pt_ex_nums(parity = TRUE, separation = TRUE)

# different ptables for cells with even/odd number of contributors
# and very small cells
ptab3 <- ptable::pt_ex_nums(parity = FALSE, separation = TRUE)

p_nums1 <- ck_params_nums(
  ptab = ptab1,
  type = "top_contr",
  top_k = 3,
  mult_params = ck_flexparams(
    fp = 1000,
    p = c(0.30, 0.03),
    epsilon = c(1, 0.5, 0.2),
    q = 3),
  mu_c = 2,
  same_key = FALSE,
  use_zero_rkeys = FALSE,
  path = f_yaml)

# we read the parameters from the yaml-file
p_nums1 <- ck_read_yaml(path = f_yaml)

# for variables with positive and negative values
p_nums2 <- ck_params_nums(
  ptab = ptab2,
  type = "top_contr",
  top_k = 3,
  mult_params = ck_flexparams(
    fp = 1000,
    p = c(0.15, 0.02),
    epsilon = c(1, 0.4, 0.15),
    q = 3),
  mu_c = 2,
  same_key = FALSE)

# simple perturbation parameters (not using the flex-function approach)
p_nums3 <- ck_params_nums(
  ptab = ptab3,
  type = "mean",
  mult_params = ck_simpleparams(p = 0.25),
  mu_c = 2,
  same_key = FALSE)

# use `p_nums1` for all variables
tab$params_nums_set(p_nums1, c("savings", "income", "expend"))

# use different parameters for variable `mixed`
tab$params_nums_set(p_nums2, v = "mixed")

# identify sensitive cells to which extra protection (`mu_c`) is added.
tab$supp_p(v = "income", p = 85)
tab$supp_pq(v = "income", p = 85, q = 90)
tab$supp_nk(v = "income", n = 2, k = 90)
tab$supp_freq(v = "income", n = 14, weighted = FALSE)
tab$supp_val(v = "income", n = 10000, weighted = TRUE)
tab$supp_cells(
  v = "income",
  inp = data.frame(
    sex = c("female", "female"),
    "age" = c("age_group1", "age_group3")
  )
)

# perturb variables
tab$perturb(v = c("income", "savings"))

# extract results
tab$numtab("income", mean_before_sum = TRUE)
tab$numtab("income", mean_before_sum = FALSE)
tab$numtab("savings")

# results can be resetted, too
tab$reset_cntvars(v = "cnt_males")

# we can then set other parameters and perturb again
tab$params_cnts_set(val = p_cnts1, v = "cnt_males")

tab$perturb(v = "cnt_males")

# write results to a .csv file
tab$freqtab(
  v = c("total", "cnt_males"),
  path = file.path(tempdir(), "outtab.csv")
)

# show results containing weighted and unweighted results
tab$freqtab(v = c("total", "cnt_males"))

# utility measures for a count variable
tab$measures_cnts(v = "total", exclude_zeros = TRUE)

# modifications for perturbed count variables
tab$mod_cnts()

# display a summary about utility measures
tab$summary()


sdcTools/cellKey documentation built on Dec. 5, 2023, 1:05 a.m.