utility.tab: Tabular utility
In synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control

utility.tab

R Documentation

Tabular utility

Description

Produces tables from observed and synthesised data and calculates utility measures to compare them with their expectation if the synthesising model is correct.

It can be also used with synthetic data NOT created by syn(), but then an additional parameter cont.na might need to be provided.

Usage

## S3 method for class 'synds'
utility.tab(object, data, vars = NULL, ngroups = 5,
            useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE,  ...)

## S3 method for class 'data.frame'
utility.tab(object, data, vars = NULL, cont.na = NULL,
            ngroups = 5, useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE, 
            compare.synorig  = TRUE, ...)

## S3 method for class 'list'
utility.tab(object, data, vars = NULL, cont.na = NULL,
            ngroups = 5, useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE, 
            compare.synorig = TRUE, ...)


## S3 method for class 'utility.tab'
print(x, print.tables = NULL,
      print.zdiff = NULL, print.stats = NULL,
      digits = NULL, ...)

Arguments

`object`	an object of class `synds`, which stands for 'synthesised data set'. It is typically created by function `syn()` or `syn.strata()` and it includes `object$m` number of synthesised data set(s), as well as `object$syn` the synthesised data set, if `m = 1`, or a list of `m` such data sets. Alternatively, when data are synthesised not using `syn()`, it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables and the same method.
`data`	the original (observed) data set.
`vars`	a single string or a vector of strings with the names of variables to be used to form the table.
`cont.na`	a named list of codes for missing values for continuous variables if different from the `R` missing data code `NA`. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.
`max.table`	a maximum table size. You could try increasing the default value, but memory problems are likely.
`ngroups`	if numerical (non-factor) variables are included they will be classified into this number of groups to form tables. Classification is performed using `classIntervals()` function for `n = ngroups`. By default, `style = "quantile"` to get appropriate groups for skewed data. Problems for variables with a small number of unique values are handled by selecting only unique values of breaks. Arguments of `classIntervals()` may be, however, specified in the call to `utility.tab()`.
`useNA`	determines if NA values are to be included in tables.
`print.tables`	a logical value that determines if tables of observed and synthesised data are to be printed. By default tables are printed if they have up to three dimensions.
`print.stats`	a single string or a vector of strings that determines which utility measures to print. Must be a selection from: `"VW"`, `"FT"`,`"JSD"`, `"SPECKS"`, `"WMabsDD"`, `"U"`, `"G"`, `"pMSE"`, `"PO50"`, `"MabsDD"`, `"dBhatt"`, `"S_VW"`, `"S_FT"`, `"S_JSD"`, `"S_WMabsDD"`, `"S_G"`, `"S_pMSE"`, `"df"`, `dfG`. If `print.stats = "all"`, all of these will be printed. For more information see the details section below.
`print.zdiff`	a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
`print.flag`	a logical value that determines if messages are to be printed during computation.
`digits`	an integer indicating the number of decimal places for printing statistics, `tab.zdiff` and mean results for `m > 1`.
`k.syn`	a logical indicator as to whether the sample size itself has been synthesised. The default value is `FALSE`, which will apply to synthetic data created by synthpop.
`compare.synorig`	a logical value to determine if the functions `synorig.compare()` should be used to check that data sets can be compared. Used when the synthetic data are supplied as a data.frame or a list when default set to TRUE.
`...`	additional parameters; can be passed to classIntervals() function.
`x`	an object of class `utility.tab`.

Details

Forms tables of observed and synthesised values for the variables specified in vars. Several utility measures are calculated from the cells of the tables, as described below. Details of all of these measures can be found in Raab et al. (2021). If the synthesising model is correct the measures VW, FT, G and JSD should have chi-square distributions with df degrees of freedom for large samples. Standardised versions of each measure are available (e.g. S_VW for VW, where S_VW = VW/df) that will have an expected value of 1 if the synthesising model is correct. Four other measures are calculated by considering the table as a prediction model. The propensity score mean-squared error pMSE, and from a comparison of propensity scores for the synthetic and original data the Kolmogorov-Smirnov statistic SPECKS and the Wilcoxon rank-sum statistic U and also the percentage of the observations correctly predicted in the combined tables over 50%(PO50) where the majority of observations in each grouping are in agreement with category (real or synthetic) of the observation. The first of these pMSE is identical except for a constant to VW. No expected values are computed for the last three of these measures, but they can be obtained by replication from utility.gen(). Three further measures are calculated from the tables. The mean absolute difference in distributions: firstly MabsDD, the avarage absolute difference in the proportions of original and synthetic data from all the cells in the table. Secondly a weighted version of this measure WMabsDD where the weights are proportional to the inverse of the variance of the absolute differences so that this measure can be standardised by its expected value, df. Finally the Bhattacharyya distances BhattD derived from the overlap of the histograms of the original and synthetic data sets.

Value

An object of class utility.tab which is a list with the following components:

`m`	number of synthetic data sets in object, i.e. `object$m`.
`VW`	a vector with `object$m` values for the Voas Williamson utility measure.; linearly related to `pMSE`.
`FT`	a vector with `object$m` values for the Freeman-Tukey utility measure.
`JSD`	a vector with `object$m` values for the Jensen-Shannaon divergence for comparing the tables.
`SPECKS`	a vector with `object$m` values for the Kolmogorov-Smirnov statistic for comparing the propensity scores for the original and synthetic data.
`WMabsDD`	a vector with `object$m` values of the weighted mean absolute difference in distributions for original and synthetic data.
`U`	a vector with `object$m` values of the Wilcoxon statistic comparing the propensity scores for the original and synthetic data.
`G`	a vector with `object$m` values for the adjusted likelihood ratio utility measure.
`pMSE`	a vector with `object$m` values of the propensity score mean-squared error; linearly related to `VW`.
`PO50`	a vector with `object$m` values of the percentage over 50% of observations correctly predicted from the propensity scores linearly related to `SPECKS` and `MabsDD`.
`MabsDD`	a vector with `object$m` values of the mean absolute difference in distributions for original and synthetic data linearly related to `SPECKS` and `PO50`.
`dBhatt`	a vector with `object$m` values of the Bhattacharyya distances between the synthetic and original data, linearly related to the square root of `FT`.
`S_VW`	`VW/df`.
`S_FT`	`FT/df`.
`S_JSD`	`JSD`/df.
`S_WMabsDD`	WMabsDD/df.
`S_G`	`G/df`.
`S_pMSE`	standardised measure from `pMSE`, identical to `S_VW`.
`df`	a vector of degrees of freedom for the chi-square tests which equal to the number of cells in the tables with any observed or synthesised counts minus one when `k.syn == FALSE` or equal to the the number of cells when `k.syn == TRUE`.
`dfG`	degrees of freedom used in standardising `G`.
`nempty`	a vector of length `object$m` with number of cells not contributing to the statistics.
`tab.obs`	a table from the observed data.
`tab.syn`	a table or a list of `m` tables from the synthetic data.
`tab.zdiff`	a table or a list of `m` tables of Z statistics for differences between observed and synthesised cells of the tables. Large absolute values indicate a large contribution to lack-of-fit.
`digits`	an integer indicating the number of decimal places for printing statistics, `tab.zdiff` and mean results for `m > 1`.
`print.tables`	a logical value that determines if tables of observed and synthesised are to be printed.
`print.stats`	a single string or a vector of strings with utility measures to be printed out.
`print.zdiff`	a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
`n`	number of observation in the original dataset.
`k.syn`	a logical indicator as to whether the sample size itself has been synthesised.

References

Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v074.i11")}.

Raab, G.M., Nowok, B. and Dibben, C. (2021). Assessing, visualizing and improving the utility of synthetic data. Available from https://arxiv.org/abs/2109.12717.

Read, T.R.C. and Cressie, N.A.C. (1988) Goodness–of–Fit Statistics for Discrete Multivariate Data, Springer–Verlag, New York.

Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.

Examples

ods <- SD2011[1:1000, c("sex", "age", "marital", "nofriend")]

s1 <- syn(ods, m = 10, cont.na = list(nofriend = -8))
utility.tab(s1, ods, vars = c("marital", "sex"), print.stats = "all")

s2 <- syn(ods, m = 1, cont.na = list(nofriend = -8))
u2 <- utility.tab(s2, ods, vars = c("marital", "age", "sex"), ngroups = 3)
print(u2, print.tables = TRUE, print.zdiff = TRUE)

### synthetic data provided as 'data.frame'
utility.tab(s2$syn, ods, vars = c("marital", "nofriend"), ngroups = 3,
            print.tables = TRUE, cont.na = list(nofriend = -8), digits = 4)

synthpop documentation built on June 8, 2025, 1:31 p.m.