utility.gen  R Documentation 
Distributional comparison of synthesised data set with the original (observed) data set using propensity scores.
This function can be also used with synthetic data NOT created by
syn()
, but then additional parameters not.synthesised
and cont.na
might need to be provided.
## S3 method for class 'synds' utility.gen(object, data, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'data.frame' utility.gen(object, data, not.synthesised = NULL, cont.na = NULL, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'list' utility.gen(object, data, not.synthesised = NULL, cont.na = NULL, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'utility.gen' print(x, digits = NULL, zthresh = NULL, print.zscores = NULL, print.stats = NULL, print.ind.results = NULL, print.variable.importance = NULL, ...)
object 
it can be an object of class 
data 
the original (observed) data set. 
not.synthesised 
a vector of variable names for any variables that has
been left unchanged in the synthetic data. Not required if oject is of
class 
cont.na 
a named list of codes for missing values for continuous
variables if different from the 
method 
a single string specifying the method for modeling the propensity
scores. Method can be selected from 
maxorder 
maximum order of interactions to be considered in

k.syn 
a logical indicator as to whether the sample size itself has been synthesised. 
tree.method 
implementation of 
max.params 
the maximum number of parameters for a 
print.stats 
statistics to be printed must be a selection from

resamp.method 
method used for resampling estimates of standardized
measures can be 
nperms 
number of permutations for the permutation test to obtain the
null distribution of the utility measure when 
cp 
complexity parameter for classification with tree.method

minbucket 
minimum number of observations allowed in a leaf for
classification when 
mincriterion 
criterion between 0 and 1 to use to control

vars 
variables to be included in the utility comparison. It can be a character vector of names of variables or an integer vector of their column indices. If none are specified all the variables in the synthesised data will be included. 
aggregate 
logical flag as to whether the data should be aggregated by
collapsing identical rows before computation. This can lead to much faster
computation when all the variables are categorical. Only works for

maxit 
maximum iterations to use when 
ngroups 
target number of groups for categorisation of each numeric
variable: final number may differ if there are many repeated values. If

print.flag 
TRUE/FALSE to indicate if any messages should be printed during calculations. Change to FALSE for simulations. 
print.every 
controls the printing of progress of resampling when

... 
additional parameters passed to 
x 
an object of class 
digits 
number of digits to print in the default output values. 
zthresh 
threshold value to use to suppress the printing of zscores
under 
print.zscores 
logical value as to whether zscores for coefficients of the logit model should be printed. 
print.ind.results 
logical value as to whether utility score results from individual syntheses should be printed. 
print.variable.importance 
logical value as to whether the variable
importance measure should be printed when 
This function follows the method for evaluating the utility of masked data as given in Snoke et al. (2018) and originally proposed by Woo et al. (2009). The original and synthetic data are combined into one dataset and propensity scores, as detailed in Rosenbaum and Rubin (1983), are calculated to estimate the probability of membership in the synthetic data set. The utility measure is based on the mean squared difference between these probabilities and the probability expected if the data did not distinguish the synthetic data from the original.
If k.syn = FALSE
the expected probability is just the proportion of
synthetic data in the combined data set, 0.5
when the original and
synthetic data have the same number of records. Setting k.syn = TRUE
indicates that the numbers of observations in the synthetic data was
synthesised and not fixed by the synthesiser. In this case the expected
probability will be 0.5
in all cases and the model to discriminate
between observed and synthetic will include an intercept term. This will
usually only apply when the standalone version of this function
utility.gen.sa()
is used.
Propensity scores can be modeled by logistic regression method = "logit"
or by two different implementations of classification and regression trees as
method "cart"
. For logistic regression the predictors are all variables
in the data and their interactions up to order maxorder
. The default of
1
gives all main effects and first order interactions. For logistic
regression the null distribution of the propensity score is derived and is
used to calculate ratios and standardised values.
For method = "cart"
the expectation and variance of the null
distribution is calculated from a permutation test. Our recent work
indicates that this method can sometimes give misleading results.
If missing values exist, indicator variables are added and included in the
model as recommended by Rosenbaum and Rubin (1984). For categorical variables,
NA
is treated as a new category.
An object of class utility.gen
which is a list including the utility
measures their expected null values for each synthetic set with the following
components:
call 
the call that produced the result. 
m 
number of synthetic data sets in object. 
method 
method used to fit propensity score. 
tree.method 
cart function used to fit propensity score when

resamp.method 
type of resampling used to get 
maxorder 
see above. 
vars 
see above. 
nfix 
see above. 
aggregate 
see above. 
maxit 
see above. 
ngroups 
see above. 
df 
degrees of freedom for the chisquared test for logit models
derived from the number of nonaliased coefficients in the logistic model,
minus 
mincriterion 
see above. 
nperms 
see above. 
incomplete 
TRUE/FALSE indicator if any of the variables being compared are not synthesised. 
pMSE 
propensity score mean square error from the utility model or a
vector of these values if 
S_pMSE 
ratio(s) of 
PO50 
percentage over 50% of each synthetic data set where the model used correctly predicts whether real or synthetic. 
S_PO50 
ratio(s) of 
SPECKS 
KolmogorovSmirnov statistic to compare the propensity scores for the original and synthetic records. 
S_SPECKS 
ratio(s) of 
print.stats 
see above. 
fit 
the fitted model for the propensity score or a list of fitted
models of length 
nosplits 
for resampling methods and cart models, a list of the number of times from the total each resampled cart model failed to select any splits to classify the indicator. Indicates that this method is not working correctly and results should not be used but a logit model selected instead. 
digits 
see above. 
print.ind.results 
see above. 
print.zscores 
see above. 
zthresh 
see above. 
print.variable.importance 
see above. 
Woo, MJ., Reiter, J.P., Oganian, A. and Karr, A.F. (2009). Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality, 1(1), 111124.
Rosenbaum, P.R. and Rubin, D.B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516524.
Snoke, J., Raab, G.M., Nowok, B., Dibben, C. and Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A, 181, Part 3, 663688.
utility.tab
## Not run: ods < SD2011[1:1000, c("age", "bmi", "depress", "alcabuse", "nofriend")] s1 < syn(ods, m = 5, method = "parametric", cont.na = list(nofriend = 8)) ### synthetic data provided as a 'synds' object u1 < utility.gen(s1, ods) print(u1, print.zscores = TRUE, zthresh = 1, digits = 6) u2 < utility.gen(s1, ods, ngroups = 3, print.flag = FALSE) print(u2, print.zscores = TRUE) u3 < utility.gen(s1, ods, method = "cart", nperms = 20) print(u3, print.variable.importance = TRUE) ### synthetic data provided as 'list' utility.gen(s1$syn, ods, cont.na = list(nofriend = 8)) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.