compute_risk: Compute identification disclosure risk
In simongrund1/robosynth: Robust Synthetic Data Generation with Data-Augmented Multiple Imputation

compute.risk

R Documentation

Compute identification disclosure risk

Description

Computes the risk of identification disclosure for a list of synthetic data sets.

Usage

compute.risk(data, original, known, synthetic, width = .10, relative = TRUE, tol = 1e-9)

Arguments

`data`	list of synthetic data sets as returned by `extract` (or similar).
`original`	data.frame: original data set.
`known`	character (optional): names of known (unsynthesized) variables to be used in the risk computation.
`synthetic`	character (optional): names of synthetic variables to be used in the risk computation.
`width`	numeric: scalar or a named vector determining the widths of the intervals for matching numeric variables.
`relative`	logical: scalar or a named vector determining the type of intervals for numeric variables. If `TRUE` (the default), then `width` is used to construct relative (percentage) intervals of varying size (`x * (1 +/- width)`). Otherwise, `width` is used to construct intervals of fixed size (`x +/- width`).
`tol`	numeric: numerical tolerance.

Details

This function computes the risk of identification disclosure for a list of synthetic data sets by attempting to match the values on the known and synthetic variables in the synthetic and original data. For each target case, matches are identified by searching for matching cases with similar (if continuous) or identical values (if categorical) on the specified variables. For continuous variables, cases are considered a match, if the true (unsynthesized) value falls into a certain interval around the synthetic value.

The size of this interval around continuous values is determined by width. If relative = TRUE (the default), the interval around a given value is $x_i$ is $[x_i (1-w), x_i (1+w)]$, where $w$ is the specified width. If relative = FALSE, the interval is $[x_i - w, x_i + w]$. Both width and relative can be a named vector to use different intervals for different variables.

The result of the computation can be further summarized with summary.robosynth.risk, and high-risk cases can be protected further with replace.high.risk.

Value

An object of class robosynth.risk.

Author(s)

Simon Grund

Examples

# create masked copies
sociosexuality <- within(sociosexuality, {

  m_sex <- mask.categorical(sex, probability = .80)
  m_sexpref <- mask.categorical(sexpref, probability = .60)
  m_age <- mask.continuous(age, reliability = .90)

})

# combine synthesis and masking models
models <- combine.models(

  synthesis.model(sex ~ 1, type = "binary"),
  synthesis.model(sexpref ~ 1 + sex, type = "categorical"),
  synthesis.model(age ~ 1 + sex + sexpref, type = "continuous"),

  masking.model(m_sex ~ sex, type = "binary"),
  masking.model(m_sexpref ~ sexpref, type = "categorical"),
  masking.model(m_age ~ age, type = "continuous"),

  data = sociosexuality

)

# run synthesis
syn <- synthesize(models = models, m = 5, iter = 5)

# extract list of synthetic data sets
synlist <- extract(syn)

# * Example 1: matching by "age" with percentage intervals (10%)
compute.risk(synlist, original = sociosexuality, synthetic = "age")
# same as:
# compute.risk(synlist, original = sociosexuality, synthetic = "age", width = .10)
# compute.risk(synlist, original = sociosexuality, synthetic = "age", width = c(age = .10))

# * Example 2: matching by "sex", "sexpref", and "age" with fixed-width intervals for "age" (1 year)
compute.risk(synlist, original = sociosexuality, known = c("sex"), synthetic = "age", width = 1.0, relative = FALSE)

simongrund1/robosynth documentation built on March 20, 2022, 6:15 p.m.