compare.synds: Compare univariate distributions of synthesised and observed...
In synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control

compare.synds

R Documentation

Compare univariate distributions of synthesised and observed data

Description

Compare synthesised data set with the original (observed) data set using percent frequency tables and histograms. When more than one synthetic data set has been generated (object$m > 1), by default pooled synthetic data are used for comparison.

This function can be also used with synthetic data NOT created by syn(), but then an additional parameter cont.na might need to be provided.

Usage

## S3 method for class 'synds'
compare(object, data, vars = NULL,
        msel = NULL, stat = "percents", breaks = 20, ngroups =5,
        nrow = 2, ncol = 2, rel.size.x = 1,
        utility.stats = c("pMSE", "S_pMSE", "df"),
        utility.for.plot = "S_pMSE",
        cols = c("#1A3C5A","#4187BF"),
        plot = TRUE, table = FALSE, 
        print.flag = TRUE, ...)

## S3 method for class 'data.frame'
compare(object, data, vars = NULL, cont.na = NULL,
        msel = NULL, stat = "percents", breaks = 20,ngroups = 5,
        nrow = 2, ncol = 2, rel.size.x = 1,
        utility.stats = c("pMSE", "S_pMSE", "df"),
        utility.for.plot = "S_pMSE",
        cols = c("#1A3C5A","#4187BF"),
        plot = TRUE, table = FALSE, 
        print.flag = TRUE, compare.synorig = TRUE, ...)

## S3 method for class 'list'
compare(object, data, vars = NULL, cont.na = NULL,
        msel = NULL, stat = "percents", breaks = 20,ngroups = 5,
        nrow = 2, ncol = 2, rel.size.x = 1,
        utility.stats = c("pMSE", "S_pMSE", "df"),
        utility.for.plot = "S_pMSE",
        cols = c("#1A3C5A","#4187BF"),
        plot = TRUE, table = FALSE, 
        print.flag = TRUE, compare.synorig = TRUE, ...)

## S3 method for class 'compare.synds'
print(x, ...)

Arguments

`object`	an object of class `synds`, which stands for 'synthesised data set'. It is typically created by function `syn()` and it includes `object$m` synthesised data set(s) as `object$syn`. Alternatively, when data are synthesised not using `syn()`, it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables and the same method.
`data`	an original (observed) data set.
`vars`	variables to be compared. If `vars` is `NULL` (the default) all synthesised variables are compared.
`cont.na`	a named list of codes for missing values for continuous variables if different from the `R` missing data code `NA`. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.
`msel`	index or indices of synthetic data copies for which a comparison is to be made. If `NULL` pooled synthetic data copies are compared with the original data.
`stat`	determines whether tables and plots present percentages `stat = "percents"`, the default, or counts `stat = "counts"`. If `m > 1` and `msel = NULL` average counts for synthetic data are derived.
`breaks`	the number of cells for the histogram.
`ngroups`	the number of groups used to categorise numeric variables when calculating the one-way utility measures.
`nrow`	the number of rows for the plotting area.
`ncol`	the number of columns for the plotting area.
`rel.size.x`	a number representing the relative size of x-axis labels.
`utility.stats`	a single string or a vector of strings that determines which utility measures to print. Must be a selection from: `"VW"`, `"FT"`,`"JSD"`, `"SPECKS"`, `"WMabsDD"`, `"U"`, `"G"`, `"pMSE"`, `"PO50"`, `"MabsDD"`, `"dBhatt"`, `"S_VW"`, `"S_FT"`, `"S_JSD"`, `"S_WMabsDD"`, `"S_G"`, `"S_pMSE"`, `"df"`. If `utility.stats = "all"`, all of these will be printed. For more information see the details section for `utility.tab`.
`utility.for.plot`	a single string that determines which utility measure to print in facet labels of the plot. Set to `NULL` to print variable names only.
`cols`	bar colors.
`plot`	a logical value with default set to `TRUE` indicating whether plots should be produced.
`table`	a logical value with default set to `FALSE` indicating whether tables should be printed.
`print.flag`	a logical value with default set to `TRUE` indicating whether a message should be printed as the statistic for each variable are calculated.
`compare.synorig`	a logical value to determine if the functions `synorig.compare()` should be used to check that data sets can be compared. Used when the synthetic data are supplied as a data.frame or a list when default set to TRUE.
`...`	additional parameters.
`x`	an object of class `compare.synds`.

Details

Missing data categories for numeric variables are plotted on the same plot as non-missing values. They are indicated by miss. suffix.

Numeric variables with fewer than 6 distinct values are changed to factors in order to make plots more readable.

Value

An object of class compare.synds which is a list including a list of comparative frequency tables (tables) and a ggplot object (plots) with bar charts/histograms. If multiple plots are produced they and their corresponding frequency tables are stored as a list.

References

Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v074.i11")}.

Examples

ods <- SD2011[ , c("sex", "age", "edu", "marital", "ls", "income")]
s1  <- syn(ods, cont.na = list(income = -8))

### synthetic data provided as a 'synds' object
compare(s1, ods, vars = "ls")
compare(s1, ods, vars = "income", stat = "counts",
        table = TRUE, breaks = 10)

### synthetic data provided as 'data.frame'
compare(s1$syn, ods, vars = "ls")
compare(s1$syn, ods, vars = "income", cont.na = list(income = -8),
        stat = "counts", table = TRUE, breaks = 10)

synthpop documentation built on June 8, 2025, 1:31 p.m.