compute.risk | R Documentation |
Computes the risk of identification disclosure for a list of synthetic data sets.
compute.risk(data, original, known, synthetic, width = .10, relative = TRUE, tol = 1e-9)
data |
list of synthetic data sets as returned by |
original |
data.frame: original data set. |
known |
character (optional): names of known (unsynthesized) variables to be used in the risk computation. |
synthetic |
character (optional): names of synthetic variables to be used in the risk computation. |
width |
numeric: scalar or a named vector determining the widths of the intervals for matching numeric variables. |
relative |
logical: scalar or a named vector determining the type of intervals for numeric variables. If |
tol |
numeric: numerical tolerance. |
This function computes the risk of identification disclosure for a list of synthetic data sets by attempting to match the values on the known
and synthetic
variables in the synthetic and original data.
For each target case, matches are identified by searching for matching cases with similar (if continuous) or identical values (if categorical) on the specified variables.
For continuous variables, cases are considered a match, if the true (unsynthesized) value falls into a certain interval around the synthetic value.
The size of this interval around continuous values is determined by width
.
If relative = TRUE
(the default), the interval around a given value is $x_i$ is $[x_i (1-w), x_i (1+w)]$, where $w$ is the specified width.
If relative = FALSE
, the interval is $[x_i - w, x_i + w]$.
Both width
and relative
can be a named vector to use different intervals for different variables.
The result of the computation can be further summarized with summary.robosynth.risk
, and high-risk cases can be protected further with replace.high.risk
.
An object of class robosynth.risk
.
Simon Grund
extract
, summary.robosynth.risk
, replace.high.risk
# create masked copies sociosexuality <- within(sociosexuality, { m_sex <- mask.categorical(sex, probability = .80) m_sexpref <- mask.categorical(sexpref, probability = .60) m_age <- mask.continuous(age, reliability = .90) }) # combine synthesis and masking models models <- combine.models( synthesis.model(sex ~ 1, type = "binary"), synthesis.model(sexpref ~ 1 + sex, type = "categorical"), synthesis.model(age ~ 1 + sex + sexpref, type = "continuous"), masking.model(m_sex ~ sex, type = "binary"), masking.model(m_sexpref ~ sexpref, type = "categorical"), masking.model(m_age ~ age, type = "continuous"), data = sociosexuality ) # run synthesis syn <- synthesize(models = models, m = 5, iter = 5) # extract list of synthetic data sets synlist <- extract(syn) # * Example 1: matching by "age" with percentage intervals (10%) compute.risk(synlist, original = sociosexuality, synthetic = "age") # same as: # compute.risk(synlist, original = sociosexuality, synthetic = "age", width = .10) # compute.risk(synlist, original = sociosexuality, synthetic = "age", width = c(age = .10)) # * Example 2: matching by "sex", "sexpref", and "age" with fixed-width intervals for "age" (1 year) compute.risk(synlist, original = sociosexuality, known = c("sex"), synthetic = "age", width = 1.0, relative = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.