#knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
SimMultiCorrData generates continuous (normal or non-normal), binary, ordinal, and count (Poisson or Negative Binomial) variables with a specified correlation matrix. It can also produce a single continuous variable. This package can be used to simulate data sets that mimic real-world situations (i.e. clinical data sets, plasmodes, as in Vaughan et al. [-@Vaughan]). All variables are generated from standard normal variables with an imposed intermediate correlation matrix. Continuous variables are simulated by specifying mean, variance, skewness, standardized kurtosis, and fifth and sixth standardized cumulants using either Fleishman's Third-Order [-@Fleish] or Headrick's Fifth-Order [-@Head2002] Polynomial Transformation. Binary and ordinal variables are simulated using a modification of
GenOrd::ordsample @GenOrd. Count variables are simulated using the inverse cdf method. There are two simulation pathways which differ primarily according to the calculation of the intermediate correlation matrix
Sigma. In Correlation Method 1, the intercorrelations involving count variables are determined using a simulation based, logarithmic correlation correction (adapting @YahShm's method). In Correlation Method 2, the count variables are treated as ordinal (adapting @FerrBarb_Pois's modification of
GenOrd). There is an optional error loop that corrects the final correlation matrix to be within a user-specified precision value. The package also includes functions to calculate standardized cumulants for theoretical distributions or from real data sets, check if a target correlation matrix is within the possible correlation bounds (given the distributions of the simulated variables), summarize results (numerically or graphically), verify valid power method pdfs, and calculate lower standardized kurtosis bounds.
The main strengths of
The user may generate correlated continuous (normal or non-normal), ordinal (r >= 2 categories), Poisson and/or Negative Binomial variables simultaneously, based on either theoretical distributions or empirical data.
Two distinct methods for generating non-normal continuous variables: Fleishman's third-order or Headrick's fifth-order polynomial transformation.
Two distinct methods for generating count variables (see Comparison of Correlation Method 1 and Correlation Method 2 vignette). The user may test each to see which yields greater simulation accuracy.
Calculation of the precise lower kurtosis boundary using the Lagrangean constraint equations, instead of an approximation (see
Valid power method pdf checks during the calculation of the constants for continuous variables, and optional use of a sixth cumulant correction value to enable the discovery of valid pdf constants.
Computation of feasible correlation bounds based on data simulation method (see
valid_corr for correlation method 1 or
valid_corr2 for correlation method 2).
Numerous attempts to reproduce the desired correlation matrix, including correcting for non-positive-definite intermediate correlation matrices and an optional final error loop (see Overview of Error Loop vignette). This error loop enables reproduction of many correlation structures that can not be achieved through other methods.
Function arguments (i.e.
epsilon) that allow the user to have greater control over simulation accuracy, speed, and reproducibility.
Detailed simulation results, including the simulation time (in minutes) and descriptions of the generated variables and the correlation structure.
Additional functions to supplement the simulation process:
calc_theory) or a vector of data by the method of moments (
calc_moments) or based on Fisher's k-statistics (
calc_fisherk). Additional summary functions compute important statistics for the generated continuous variables.
ggplot2objects so that the user may save them or further adapt the graphs as necessary.
There are several other simulation packages. For example, Barbiero & Ferrari's [-@GenOrd]
GenOrd, Amatya & Demirtas' [-@MultiOrd]
MultiOrd, Leisch, Kaiser, & Hornik's [-@Orddata]
orddata, and Demirtas, Nordgren, & Allozi's [-@PoisBinOrdNonNor]
PoisBinOrdNonNor. The first three generate only binary and ordinal data, while the last generates Poisson, binary, ordinal, and non-normal variables.
GenOrd generates discrete random variables (i.e. binary or ordinal) with given correlation matrix and marginal distributions. The method used to determine the intermediate MVN correlation matrix in
GenOrd::ordcont has been modified in
ordnorm function. It works by setting the intermediate correlation equal to the target correlation of the discrete variables. Each intermediate pairwise correlation is updated until the final pairwise correlation is within a user-specified precision value (
epsilon) of the target correlation or the maximum number of iterations (
maxit) has been reached.
GenOrd::ordcont has been modified in the following ways:
Sigmafor all variable types, and if necessary,
Sigmais converted to the nearest positive-definite matrix using Higham's (2002) algorithm in
GenOrd::contord to calculate the ordinal correlation obtained from discretizing the normal variables generated from the intermediate correlation matrix
Sigma. The reason is because the function does not require random generation of the normal variables, which ensures greater reproducibility.
SimMultiCorrData also improves the way ordinal variables are generated, as compared to
rcorrvar2allow a user-specified seed, maximum number of iterations, and epsilon value.
GenOrd::ordsamplestops if the intermediate correlation matrix
Sigmais not positive-definite. As described above,
SimMultiCorrDataattempts to correct the problem and a warning is given that it may not be possible to produce the desired correlation matrix.
MultiOrd generates multivariate ordinal data with given correlation matrix and marginal distributions via the binary conversion method of Demirtas [-@Dem_Ord]. This method computes the binary marginals by collapsing the marginal distributions of the ordinal variables. The intermediate correlation matrix is also computed through an iterative process based on matching the target matrix. Binary data are then converted to ordinal data through a randomization step. This procedure requires the simulation of large samples of binary data in order to maximize accuracy, which requires greater computational time and resources than the methods used in
orddata generates binary and ordinal data through 4 available methods:
PoisBinOrdNonNor is one in an extensive series of simulation packages created by Demirtas with additional authors. Other packages include
PoisNor [@PoisNor], and
PoisBinOrdNonNor generates Poisson, binary, ordinal, and non-normal variables. It differs from
SimMultiCorrData in the following ways:
SimMultiCorrData's simulation functions
rcorrvar2allow the user to either provide an intermediate matrix or the matrix is calculated during the simulation.
PoisBinOrdNonNordoes not produce Negative Binomial variables.
SimMultiCorrData. However, those for ordinal variables are found using
ordcont, which, as previously mentioned, will stop if the intermediate matrix is not positive-definite.
SimMultiCorrDatacontains the functions
pdf_check. The function that solves for the constants (
SimMultiCorrData::find_constants) executes these checks when finding the constants and attempts to produce valid pdf constants. In the case of Headrick's fifth-order method, the user may specify a sixth cumulant correction value to help find these constants.
PoisBinOrdNonNoris a simple approximation: $\Large standardized\ kurtosis \ge skew^2 - 2$.
SimMultiCorrData::calc_lower_skurtsolves the Lagrangean expressions (as described in @Head2002 and @HeadSaw2) that determine the precise lower kurtosis boundary. Examination of the boundaries computed in
PoisBinOrdNonNordemonstrates that the approximate boundaries are much lower than the actual Fleishman boundaries, indicating that the guideline is not accurate (see
PoisBinOrdNonNordoes not allow the user to specify a seed for random number generation, or an epsilon value and maximum number of iterations to use when determining the intermediate ordinal correlations. These specifications, as found in
SimMultiCorr's simulation functions
rcorrvar2, are essential for reproducibility and controlling accuracy.
SimMultiCorr's simulation functions produce detailed summaries of the variables, the final correlation matrix, the maximum error between the final and target correlation matrices, and the simulation time.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.