utility.tab | R Documentation |
Produces tables from observed and synthesised data and calculates utility measures to compare them with their expectation if the synthesising model is correct.
It can be also used with synthetic data NOT created by syn()
,
but then an additional parameter cont.na
might need to be provided.
## S3 method for class 'synds' utility.tab(object, data, vars = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'data.frame' utility.tab(object, data, vars = NULL, cont.na = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'list' utility.tab(object, data, vars = NULL, cont.na = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'utility.tab' print(x, print.tables = NULL, print.zdiff = NULL, print.stats = NULL, digits = NULL, ...)
object |
an object of class |
data |
the original (observed) data set. |
vars |
a single string or a vector of strings with the names of variables to be used to form the table. |
cont.na |
a named list of codes for missing values for continuous
variables if different from the |
max.table |
a maximum table size. You could try increasing the default value, but memory problems are likely. |
ngroups |
if numerical (non-factor) variables are included they will be
classified into this number of groups to form tables. Classification is
performed using |
useNA |
determines if NA values are to be included in tables. |
print.tables |
a logical value that determines if tables of observed and synthesised data are to be printed. By default tables are printed if they have up to three dimensions. |
print.stats |
a single string or a vector of strings that determines
which utility measures to print. Must be a selection from:
|
print.zdiff |
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed. |
print.flag |
a logical value that determines if messages are to be printed during computation. |
digits |
an integer indicating the number of decimal places for printing
statistics, |
k.syn |
a logical indicator as to whether the sample size itself has
been synthesised. The default value is |
... |
additional parameters; can be passed to classIntervals() function. |
x |
an object of class |
Forms tables of observed and synthesised values for the variables
specified in vars
. Several utility measures are calculated from the cells
of the tables, as described below. Details of all of these measures can be found
in Raab et al. (2021). If the synthesising model is correct the measures
VW
, FT
, G
and JSD
should have chi-square distributions
with df
degrees of freedom for large samples. Standardised versions of each
measure are available (e.g. S_VW
for VW
, where S_VW = VW/df
)
that will have an expected value of 1
if the synthesising model is correct.
Four other measures are calculated by considering the table as a prediction model.
The propensity score mean-squared error pMSE
, and from a comparison of
propensity scores for the synthetic and original data the Kolmogorov-Smirnov
statistic SPECKS
and the Wilcoxon rank-sum statistic U
and also
the percentage of the observations correctly predicted in the combined tables over
50%(PO50
) where the majority of observations in each grouping are in
agreement with category (real or synthetic) of the observation. The first of these
pMSE
is identical except for a constant to VW
. No expected values are
computed for the last three of these measures, but they can be obtained by replication
from utility.gen()
.
Three further measures are calulated from the tables. The mean absolute difference
in distributions: firstly MabsDD
, the avarage absolute difference in the
proportions of original and synthetic data from all the cells in the table.
Secondly a weighted version of this measure WMabsDD
where the weights are
proportional to the inverse of the variance of the absolute differences so that
this measure can be standardised by its expected value, df
. Finally the
Bhattacharyya distances BhattD
derived from the overlap of the histograms
of the original and synthetic data sets.
An object of class utility.tab
which is a list with the following
components:
m |
number of synthetic data sets in object, i.e. |
VW |
a vector with |
FT |
a vector with |
JSD |
a vector with |
SPECKS |
a vector with |
WMabsDD |
a vector with |
U |
a vector with |
G |
a vector with |
pMSE |
a vector with |
PO50 |
a vector with |
MabsDD |
a vector with |
dBhatt |
a vector with |
S_VW |
|
S_FT |
|
S_JSD |
|
S_WMabsDD |
WMabsDD/df. |
S_G |
|
S_pMSE |
standardised measure from |
df |
a vector of degrees of freedom for the chi-square tests which equal
to the number of cells in the tables with any observed or
synthesised counts minus one when |
dfG |
degrees of freedom used in standardising |
nempty |
a vector of length |
tab.obs |
a table from the observed data. |
tab.syn |
a table or a list of |
tab.zdiff |
a table or a list of |
digits |
an integer indicating the number of decimal places
for printing statistics, |
print.tables |
a logical value that determines if tables of observed and synthesised are to be printed. |
print.stats |
a single string or a vector of strings with utility measures to be printed out. |
print.zdiff |
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed. |
n |
number of observation in the original dataset. |
k.syn |
a logical indicator as to whether the sample size itself has been synthesised. |
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi: 10.18637/jss.v074.i11.
Raab, G.M., Nowok, B. and Dibben, C. (2021). Assessing, visualizing and improving the utility of synthetic data. Available from https://arxiv.org/abs/2109.12717.
Read, T.R.C. and Cressie, N.A.C. (1988) Goodness–of–Fit Statistics for Discrete Multivariate Data, Springer–Verlag, New York.
Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.
utility.gen
ods <- SD2011[1:1000, c("sex", "age", "marital", "nofriend")] s1 <- syn(ods, m = 10, cont.na = list(nofriend = -8)) utility.tab(s1, ods, vars = c("marital", "sex"), print.stats = "all") s2 <- syn(ods, m = 1, cont.na = list(nofriend = -8)) u2 <- utility.tab(s2, ods, vars = c("marital", "age", "sex"), ngroups = 3) print(u2, print.tables = TRUE, print.zdiff = TRUE) ### synthetic data provided as 'data.frame' utility.tab(s2$syn, ods, vars = c("marital", "nofriend"), ngroups = 3, print.tables = TRUE, cont.na = list(nofriend = -8), digits = 4)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.