library(simdata) library(fitdistrplus) library(dplyr) library(ggplot2) library(patchwork) library(ggcorrplot) knitr::opts_chunk$set( error = TRUE, eval = requireNamespace("NHANES", quietly = TRUE) )
This document describes the workflow to define NORmal-To-Anything (NORTA)
based simulation designs using the simdata
package. The method is very
useful to re-create existing datasets through a parametric
approximation for usage in simulation studies. It is also quite easy to use,
and allows the definition of presets for sharing simulation setups.
General details of the methodology and further references are given in e.g.
@NORTA1 and @NORTA2.
In this vignette we will prefix all relevant function calls by ::
to show
the package which implements the function - this is not necessary but only
done for demonstration purposes.
The goal of the NORTA procedure is to produce identically independently distributed (iid) samples from random variables with a given correlation structure (Pearson correlation matrix) and given marginal distributions, thereby e.g. approximating existing datasets.
Following @NORTA2, we want to sample iid replicates of the random vector $X = (X_1, X_2, \ldots, X_k)$. Denote by $F_i(s) = P(X_i \leq s)$ the distribution functions (i.e. the marginal distributions) of the components of $X$, and by $\Sigma_X$ the $k \times k$ correlation matrix of $X$. Then NORTA proceeds as follows:
The resulting vector $X$ then has the desired marginal distribution. To obtain
the target correlation structure $\Sigma_X$, the correlation matrix $\Sigma_Z$
for the first step has to be chosen appropriately. This can be achieved
via solving univariable optimisation problems for each pair of variables
$X_i$ and $X_j$ in $X$ and is part of the simdata
package.
The NORTA procedure has some known limitations, which may lead to discrepancies
between the target correlation structure and the correlation structure obtained
from the sampling process. These are, however, partly alleviated when using
existing datasets as templates, or by special techniques within simdata
.
simdata
package ensures positive definiteness by using
the closest positive definite matrix instead. This may lead to discrepancies
between the target correlation and the achieved correlation structure.NORTA is well suited to re-create existing datasets through an explicit
parametric approximation. Similar methods exist, that achieve this through
other means. A particularly interesting alternative is the generation
of synthetic datasets using an approach closely related to multiple imputation,
and is implemented in e.g. the synthpop
R package (@Synthpop). Its' primary
aim is to achieve confidentiality by re-creating copies to be shared for
existing, sensitive datasets.
In comparison, synthpop
potentially offers more flexible data generation than
NORTA, thereby leading to a better approximation of an original dataset.
However, synthpop
is also more opaque than the explicit, user defined
specification of correlation and marginal distributions of NORTA. This also
entails that synthpop
can be generally used more like a black-box approach,
which requires little user input, but is also less transparent than the
manual curation of the simulation setup in NORTA. Furthermore, NORTA allows
easy changes to the design to obtain a wide variety of study designs from a
single template dataset, whereas synthpop
is more targeted at re-creating
the original dataset.
Both methods therefore have their distinct usecases and complement each other.
simdata
Given the outline of the method, all the user has to specify to define a NORTA design on $k$ variables are
These can be estimated from existing datasets of interest. simdata
offers a
helper function to automate this process, but the user can also specify
the required input manually. We demonstrate both use cases in the
example below.
The required marginal distributions are given as quantile functions. R provides
implementations of many standard distributions which can be directly used, see
the help on distributions
. The quantile functions use the prefix "q", as in
e.g. qnorm
or qbinom
. Further implementations can be found in the
packages extraDistr
, actuar
and many others (see
https://CRAN.R-project.org/view=Distributions).
In this example we will setup a NORTA based simulation design for a dataset
extracted from the National Health And Nutrition Examination Survey (NHANES),
accessible in R via several packages (we use the NHANES
package in this
demo).
First we will load the dataset and extract several variables of interest,
namely gender ('Gender'), age ('Age'), race ('Race'), weight ('Weight'),
bmi ('BMI'), systolic ('BPsys') and diastolic blood pressure ('BPdia'). These
variabes demonstrate several different kinds of distributions. For a detailed
description of the data, please see the documentation at
https://www.cdc.gov/nchs/nhanes.htm and for the NHANES
R package on CRAN.
Here we are not concerned with the exact codings of the variables, so we will
remove labels to factor variables and work with numeric codes. Further, we will
only use data from a single survey round to keep the dataset small.
df = NHANES::NHANES %>% filter(SurveyYr == "2011_12") %>% select(Gender, Age, Race = Race1, Weight, BMI, BPsys = BPSys1, BPdia = BPDia1) %>% filter(complete.cases(.)) %>% filter(Age > 18) %>% mutate(Gender = if_else(Gender == "male", 1, 2), Race = as.numeric(Race)) print(head(df))
Using this dataset, we first define the target correlation cor_target
and
plot it.
cor_target = cor(df) ggcorrplot::ggcorrplot(cor_target, lab = TRUE)
Further, we define a list of marginal distributions dist
representing the
individual variables. Each entry of the list must be a function in one
argument, defining the quantile function of the variable. The order of the
entries must correspond to the order in the target correlation cor_target
.
simdata
offers the helper function simdata::quantile_functions_from_data()
to automate estimation of quantile functions from the available data. It does so
non-parametrically and implements two approaches, one more suited for
categorical data, and the other more suited to continuous data. In practice
the parameter n_small
can be used to determine a number of unique values
required to use the latter approach, rather than the former. See the
documentation for more details.
dist_auto = quantile_functions_from_data(df, n_small = 15)
We use the fitdistrplus::fitdist
function to find
appropriate distribution candidates and fit their parameters. Decisions
regarding the fit of a distribution can be made using e.g. the
Akaike information criterion (AIC) or Bayesian information criterion (BIC)
displayed by the summary of the fit object returned by the function (the lower
their values, the better the fit).
In case a parametric distribution doesn't fit very well, we instead make use of a density estimate and use this to define the marginal quantile function.
stats::density
function. cut
parameter.stats::approxfun
to derive a univariable quantile
function.table()
on the full dataset.LaplacesDemon
implemented via qcat
fitdistrplus::fitdist
fitdistrplus::fitdist
fitdistrplus::fitdist
after removing zero values from the dataThe code to implement these marginal distributions is shown below.
dist = list() # gender dist[["Gender"]] = function(x) qbinom(x, size = 1, prob = 0.5) # age dens = stats::density(df$Age, cut = 1) # cut defines how to deal with boundaries # integrate int_dens = cbind(Age = dens$x, cdf = cumsum(dens$y)) # normalize to obtain cumulative distribution function int_dens[, "cdf"] = int_dens[, "cdf"] / max(int_dens[, "cdf"]) # derive quantile function # outside the defined domain retun minimum and maximum age, respectively dist[["Age"]] = stats::approxfun(int_dens[, "cdf"], int_dens[, "Age"], yleft = min(int_dens[, "Age"]), yright = max(int_dens[, "Age"])) # race dist[["Race"]] = function(x) cut(x, breaks = c(0, 0.112, 0.177, 0.253, 0.919, 1), labels = 1:5) # weight fit = fitdistrplus::fitdist(as.numeric(df$Weight), "gamma") summary(fit) dist[["Weight"]] = function(x) qgamma(x, shape = 16.5031110, rate = 0.2015375) # bmi fit = fitdistrplus::fitdist(as.numeric(df$BMI), "lnorm") summary(fit) dist[["BMI"]] = function(x) qlnorm(x, meanlog = 3.3283118, sdlog = 0.2153347) # systolic blood pressure fit = fitdistrplus::fitdist(as.numeric(df$BPsys), "lnorm") summary(fit) dist[["BPsys"]] = function(x) qlnorm(x, meanlog = 4.796213, sdlog = 0.135271) # diastolic blood pressure fit = fitdistrplus::fitdist(as.numeric(df %>% filter(BPdia > 0) %>% pull(BPdia)), "norm") summary(fit) dist[["BPdia"]] = function(x) qnorm(x, mean = 71.75758, sd = 11.36352)
Both, the automatic and the manual way to specify marginals may be useful. The automatic way works non-parametrically which may be useful when a real dataset should be re-created, while the manual way allows to specify marginals parametrically which may be useful when the data is defined from purely theoretical specifications.
Now we can use simdata::simdesign_norta
to obtain designs using both the
manual and automated marginal specifications.
After that, we simulate datasets of the same size as the
original data set using simdata::simulate_data
, and compare the resulting
summary statistics and correlation structures.
# use automated specification dsgn_auto = simdata::simdesign_norta(cor_target_final = cor_target, dist = dist_auto, transform_initial = data.frame, names_final = names(dist), seed_initial = 1) simdf_auto = simdata::simulate_data(dsgn_auto, nrow(df), seed = 2) # use manual specification dsgn = simdata::simdesign_norta(cor_target_final = cor_target, dist = dist, transform_initial = data.frame, names_final = names(dist), seed_initial = 1) simdf = simdata::simulate_data(dsgn, nrow(df), seed = 2)
Summary statistics of the original and simulated datasets.
summary(df) summary(simdf_auto) summary(simdf)
Correlation structures of the original and simulated datasets.
ggcorrplot::ggcorrplot(cor(df), title = "Original", lab = TRUE) ggcorrplot::ggcorrplot(cor(simdf_auto), title = "Simulated (automated)", lab = TRUE) ggcorrplot::ggcorrplot(cor(simdf), title = "Simulated (manual)", lab = TRUE)
We may also inspect the continuous variables regarding their univariate and bivariate distributions. The original data is shown in black, the simulated data is shown in red. (Note that we only use the first 1000 observations to speed up the plotting.)
vars = c("Age", "Weight", "BMI", "BPsys", "BPdia") limits = list( Age = c(10, 90), Weight = c(0, 225), BMI = c(10, 80), BPsys = c(65, 240), BPdia = c(0, 140) ) plist = list() for (i in seq_along(vars)) for (j in seq_along(vars)) { vari = vars[i] varj = vars[j] if (i == j) { p = ggplot(df, aes_string(x = vari)) + geom_density() + geom_density(data = simdf, color = "red") + coord_cartesian(xlim = limits[[vari]]) } else if (i < j) { p = ggplot(df[1:1000, ], aes_string(x = vari, y = varj)) + geom_point(alpha = 0.04) + coord_cartesian(xlim = limits[[vari]], ylim = limits[[varj]]) } else { p = ggplot(simdf_auto[1:1000, ], aes_string(x = varj, y = vari)) + geom_point(color = "red", alpha = 0.04) + coord_cartesian(xlim = limits[[varj]], ylim = limits[[vari]]) } plist = append(plist, list(p + theme_bw(base_size = 7))) } p = patchwork::wrap_plots(plist) + patchwork::plot_annotation(title = "Simulated (automated)") print(p) plist = list() for (i in seq_along(vars)) for (j in seq_along(vars)) { vari = vars[i] varj = vars[j] if (i == j) { p = ggplot(df, aes_string(x = vari)) + geom_density() + geom_density(data = simdf, color = "red") + coord_cartesian(xlim = limits[[vari]]) } else if (i < j) { p = ggplot(df[1:1000, ], aes_string(x = vari, y = varj)) + geom_point(alpha = 0.04) + coord_cartesian(xlim = limits[[vari]], ylim = limits[[varj]]) } else { p = ggplot(simdf[1:1000, ], aes_string(x = varj, y = vari)) + geom_point(color = "red", alpha = 0.04) + coord_cartesian(xlim = limits[[varj]], ylim = limits[[vari]]) } plist = append(plist, list(p + theme_bw(base_size = 7))) } p = patchwork::wrap_plots(plist) + patchwork::plot_annotation(title = "Simulated (manual)") print(p)
From this we can observe, that the agreement between the original data and the simulated data is generally quite good. Both, automated and manual specification work equally well for this dataset. Note, however, that e.g. the slightly non-linear relationship between age and diastolic blood pressure cannot be fully captured by the approach, as expected. Furthermore, the original data shows some outliers, which are also not reproducible due to the parametric nature of the NORTA procedure.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.