Description Usage Arguments Details Value Methods (by generic) References Examples
This function primarily uses the exhaustive tabulation method to quantify disclosure risk. It tabulates cell counts for different combinations of variables provided by the user. Using these counts, this function identifies variable categories and records which are considered high risk for disclosure. File-level re-identification risk measures are also provided, e.g., Mu-Argus (Polettini 2003) and the risk metrics promosed in El Emam (2011).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | sdc_extabs(
data,
ID = NULL,
weight = NULL,
varpool = names(data),
forcelist = character(0),
forcenum = 1,
missingdef = list(),
mindim = 1,
maxdim = 2,
threshold = NULL,
wgtthreshold = NULL,
condition = NULL,
output_filename = NULL,
tau1 = 0.2,
tau2 = 0.2,
include_mu_argus = TRUE
)
## S3 method for class 'sdc_extabs'
print(x, cutoff = 50, summary_outfile = NULL, ...)
## S3 method for class 'sdc_extabs'
plot(x, plotpath = NULL, plotvar1 = character(0), plotvar2 = character(0), ...)
|
data |
Data frame containing the data for which we are to measure disclosure risk. Unexpected behavior may result if any column name begins with a period. |
ID |
Name of column which identifies records. If NULL (default), an ID column named .ROW_NUMBER is created and used in reports. |
weight |
Column name for sampling weights. NULL or empty if none. |
varpool |
Vector of column names over which to form tables. |
forcelist |
Vector of variable names. Some are included in all tabulations. Optional. |
forcenum |
Number of variables in |
missingdef |
A named list specifying missing values. The names
correspond to column names in |
mindim |
Integer specifying the minimum number of |
maxdim |
Integer specifying the maximum number of |
threshold |
Threshold to determine the number of violations in terms of
cell counts. If the number of cases in a cell is less than |
wgtthreshold |
Threshold to determine violations in terms of weighted cell counts. If NULL, a weighted threshold will not be used. |
condition |
Character string describing how weighted and unweighted
thresholds are combined when both are used. If used, it must be "and" or
"or" (case insensitive). This parameter is ignored if |
output_filename |
Name of the csv file to save the data set with violation counts and Mu-Argus scores attached. NULL if no output file is to be saved. |
tau1 |
A threshold to compute the risk measure, pRa. See User Manual for more details. |
tau2 |
A threshold to compute the risk measure, jRa. This parameter is
ignored if |
include_mu_argus |
Flag indicating whether Mu-Argus and El-Emam metrics should be calculated. |
x |
An object of class |
cutoff |
The number of variable categories with the highest percentage of cell violations for each table dimension. Default is 50. |
summary_outfile |
Name of summary output .txt file. If not NULL, console output is copied to the file. Default is NULL (no logging of output). Errors and warnings are not diverted (consider running in batch mode if logging of errors and warnings is needed). |
... |
Currently unused. For NextMethod compatibility. |
plotpath |
Directory to save plots. Plots are saved as jpeg files
(quality = 100%). If the directory does not exist, it is first created.
If |
plotvar1 |
A vector of names of discrete variables for boxplots. If none, boxplots are not produced. |
plotvar2 |
A vector of names of continuous variables for scatterplots. If none, scatterplots are not produced. |
If a specified missing value contains
only whitespace, it will match any element with only whitespace. NA values
in data are treated as missing regardless of missingdef
. If you do not
want NA values to be treated as missing, please recode them before
passing the data to this function.
Note that if a weight variable is not provided, the number of statistics and plots that are produced is significantly reduced.
An object of type sdc_extabs
. Internally, a named list of
statistics.
Cell counts and violation flags. Represented as
a list with each element corresponding to a varpool
combination.
The original data with new columns showing statistics such as violation counts and Mu-Argus score for each record.
Same as
data_with_statistics
but with missing value recodes.
Summary table of Mu-Argus by cell count. For
this summary, all variables in varpool
are used to define a cell. If
weight is NULL, then this summary is omitted.
List of file-level re-identification risk measures.
Table with percent of records that are in violation for each variable/category.
Table with percent of cells that are in violation for each dimension/variable/category.
Options provided to sdc_extabs
by the user,
such as missingdef
, mindim
, etc.
print
: S3 print method for sdc_extabs
objects
Prints a nicely formatted version of the percent record violations by variable/category and percent cell violations by dimension/variable/category
plot
: S3 plot method for sdc_extabs
objects
Produces boxplots and scatterplots of violation counts and mu-argus scores.
el2011methodsSDCNway
\insertRefpolettini2003someSDCNway
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | data(exampledata)
vars <- c("BIB1201", "BIC0501", "BID0101", "BIE0601", "BORNUSA", "CENREG",
"DAGE3", "DRACE3", "EDUC3", "GENDER")
results <- sdc_extabs(exampledata,
ID="CASEID",
weight="WEIGHT",
varpool=vars,
mindim=2,
maxdim=3,
missingdef=list(BIE0601=5),
wgtthreshold=3000,
condition="or")
print(results, cutoff=15)
plot(results, plotvar1="BORNUSA", plotvar2="WEIGHT")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.