utility.tab  R Documentation 
Produces tables from observed and synthesised data and calculates utility measures to compare them with their expectation if the synthesising model is correct.
It can be also used with synthetic data NOT created by syn()
,
but then an additional parameter cont.na
might need to be provided.
## S3 method for class 'synds' utility.tab(object, data, vars = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'data.frame' utility.tab(object, data, vars = NULL, cont.na = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'list' utility.tab(object, data, vars = NULL, cont.na = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'utility.tab' print(x, print.tables = NULL, print.zdiff = NULL, print.stats = NULL, digits = NULL, ...)
object 
an object of class 
data 
the original (observed) data set. 
vars 
a single string or a vector of strings with the names of variables to be used to form the table. 
cont.na 
a named list of codes for missing values for continuous
variables if different from the 
max.table 
a maximum table size. You could try increasing the default value, but memory problems are likely. 
ngroups 
if numerical (nonfactor) variables are included they will be
classified into this number of groups to form tables. Classification is
performed using 
useNA 
determines if NA values are to be included in tables. 
print.tables 
a logical value that determines if tables of observed and synthesised data are to be printed. By default tables are printed if they have up to three dimensions. 
print.stats 
a single string or a vector of strings that determines
which utility measures to print. Must be a selection from:

print.zdiff 
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed. 
print.flag 
a logical value that determines if messages are to be printed during computation. 
digits 
an integer indicating the number of decimal places for printing
statistics, 
k.syn 
a logical indicator as to whether the sample size itself has
been synthesised. The default value is 
... 
additional parameters; can be passed to classIntervals() function. 
x 
an object of class 
Forms tables of observed and synthesised values for the variables
specified in vars
. Several utility measures are calculated from the cells
of the tables, as described below. Details of all of these measures can be found
in Raab et al. (2021). If the synthesising model is correct the measures
VW
, FT
, G
and JSD
should have chisquare distributions
with df
degrees of freedom for large samples. Standardised versions of each
measure are available (e.g. S_VW
for VW
, where S_VW = VW/df
)
that will have an expected value of 1
if the synthesising model is correct.
Four other measures are calculated by considering the table as a prediction model.
The propensity score meansquared error pMSE
, and from a comparison of
propensity scores for the synthetic and original data the KolmogorovSmirnov
statistic SPECKS
and the Wilcoxon ranksum statistic U
and also
the percentage of the observations correctly predicted in the combined tables over
50%(PO50
) where the majority of observations in each grouping are in
agreement with category (real or synthetic) of the observation. The first of these
pMSE
is identical except for a constant to VW
. No expected values are
computed for the last three of these measures, but they can be obtained by replication
from utility.gen()
.
Three further measures are calulated from the tables. The mean absolute difference
in distributions: firstly MabsDD
, the avarage absolute difference in the
proportions of original and synthetic data from all the cells in the table.
Secondly a weighted version of this measure WMabsDD
where the weights are
proportional to the inverse of the variance of the absolute differences so that
this measure can be standardised by its expected value, df
. Finally the
Bhattacharyya distances BhattD
derived from the overlap of the histograms
of the original and synthetic data sets.
An object of class utility.tab
which is a list with the following
components:
m 
number of synthetic data sets in object, i.e. 
VW 
a vector with 
FT 
a vector with 
JSD 
a vector with 
SPECKS 
a vector with 
WMabsDD 
a vector with 
U 
a vector with 
G 
a vector with 
pMSE 
a vector with 
PO50 
a vector with 
MabsDD 
a vector with 
dBhatt 
a vector with 
S_VW 

S_FT 

S_JSD 

S_WMabsDD 
WMabsDD/df. 
S_G 

S_pMSE 
standardised measure from 
df 
a vector of degrees of freedom for the chisquare tests which equal
to the number of cells in the tables with any observed or
synthesised counts minus one when 
dfG 
degrees of freedom used in standardising 
nempty 
a vector of length 
tab.obs 
a table from the observed data. 
tab.syn 
a table or a list of 
tab.zdiff 
a table or a list of 
digits 
an integer indicating the number of decimal places
for printing statistics, 
print.tables 
a logical value that determines if tables of observed and synthesised are to be printed. 
print.stats 
a single string or a vector of strings with utility measures to be printed out. 
print.zdiff 
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed. 
n 
number of observation in the original dataset. 
k.syn 
a logical indicator as to whether the sample size itself has been synthesised. 
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 126. doi: 10.18637/jss.v074.i11.
Raab, G.M., Nowok, B. and Dibben, C. (2021). Assessing, visualizing and improving the utility of synthetic data. Available from https://arxiv.org/abs/2109.12717.
Read, T.R.C. and Cressie, N.A.C. (1988) Goodness–of–Fit Statistics for Discrete Multivariate Data, Springer–Verlag, New York.
Voas, D. and Williamson, P. (2001) Evaluating goodnessoffit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177200.
utility.gen
ods < SD2011[1:1000, c("sex", "age", "marital", "nofriend")] s1 < syn(ods, m = 10, cont.na = list(nofriend = 8)) utility.tab(s1, ods, vars = c("marital", "sex"), print.stats = "all") s2 < syn(ods, m = 1, cont.na = list(nofriend = 8)) u2 < utility.tab(s2, ods, vars = c("marital", "age", "sex"), ngroups = 3) print(u2, print.tables = TRUE, print.zdiff = TRUE) ### synthetic data provided as 'data.frame' utility.tab(s2$syn, ods, vars = c("marital", "nofriend"), ngroups = 3, print.tables = TRUE, cont.na = list(nofriend = 8), digits = 4)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.