View source: R/functions.syn.r
syn.catall | R Documentation |
A saturated model is fitted to a table produced by cross-tabulating all the variables.
syn.catall(x, k, proper = FALSE, priorn = 1, structzero = NULL,
maxtable = 1e8, epsilon = 0, delta = 0.05,rand = TRUE,
noisetype = "", ...)
x |
a data frame ( |
k |
a number of rows in each synthetic data set - defaults to |
proper |
if |
priorn |
the sum of the parameters of the Dirichelet prior which can be thought of as a pseudo-count giving the number of observations that inform prior knowledge about the parameters. |
structzero |
a named list of lists that defines which cells in the table
are structural zeros and will remain as zeros in the synthetic data, by
leaving their prior as zeros. Each element of the |
maxtable |
a number of cells in the cross-tabulation of all the variables that will trigger a severe warning. |
epsilon |
measures scale of Laplace Gaussian or Exponential noise to be added under differential privacy (DP) |
delta |
Parameter delta for Gaussian noise when this method is used to make the synthesis approximately differentially private (DP) |
rand |
for DP versions determines if multinomial noise is to be added to DP counts. If it is set to false the DP adjusted counts are simply rounded to a whole number in a manner that preserves the desired sample size (k). |
noisetype |
One of "Laplace" "Gaussian" or "Exponential" to determine the type of noise to be added that will make the synthesis DP (Laplace, Exponential) or approximately DP (Gaussian). For noisetype "Gaussian" your synthesis will fail if epsilon >1 or delta not in range 0-1. |
... |
additional parameters. |
When used in syn
function the group of categorical variables
with method = "catall"
must all be together at the start of the
visit.sequence
. Subsequent variables in visit.sequence
are then
synthesised conditional on the synthesised values of the grouped variables.
A saturated model is fitted to a table produced by cross-tabulating all the
variables. Prior probabilities for the proportions in each cell of the table
are specified from the parameters of a Dirichlet distribution with the same
parameter for every cell in the table that is not a structural zero (see above).
The sum of these parameters is priorn
so that each one is priorn/N
where N
is the number of cells in the table that are not structural zeros.
The default priorn = 1
can be thought of as equivalent to the knowledge
that 1
observation would be equally likely to be in any cell that is not
a structural zero. The posterior expectation, given the observed counts,
for the probability of being in a cell with observed count n_i
is thus (n_i + priorn/N) / (N + priorn)
. The synthetic data are generated
from a multinomial distribution with parameters given by these probabilities.
Unlike syn.satcat
, which fits saturated conditional models,
the synthesised data can include any combination of variables, except
those defined by the combinations of variables in structzero
.
NOTE that when the function is called by setting elements of method in
syn()
to "catall"
, the parameters priorn
, structzero
,
maxtable
, epsilon
, and rand
must be supplied to syn
as e.g. catall.priorn
.
A list with two components:
res |
a data frame of dimension |
fit |
the cross-tabulation of all the original variables used. |
ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)]
table(ods[, c("placesize", "region")])
# Each `placesize_region` sublist:
# for each relevant level of `placesize` defined in the first element,
# the second element defines regions (variable `region`) that do not
# have places of that size.
struct.zero <- list(
placesize_region = list(placesize = "URBAN 500,000 AND OVER",
region = c(2, 4, 5, 8:13, 16)),
placesize_region = list(placesize = "URBAN 200,000-500,000",
region = c(3, 4, 10:11, 13)),
placesize_region = list(placesize = "URBAN 20,000-100,000",
region = c(1, 3, 5, 6, 8, 9, 14:15)))
# you could use the object struct.zero in the command below
# byt devtools checking did not like it so have added the list instead
syncatall <- syn(ods, method = c(rep("catall", 4), "ctree", "normrank", "ctree"),
catall.priorn = 2, catall.structzero = list(
placesize_region = list(placesize = "URBAN 500,000 AND OVER",
region = c(2, 4, 5, 8:13, 16)),
placesize_region = list(placesize = "URBAN 200,000-500,000",
region = c(3, 4, 10:11, 13)),
placesize_region = list(placesize = "URBAN 20,000-100,000",
region = c(1, 3, 5, 6, 8, 9, 14:15))))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.