Description Vignettes Functions References See Also

SimCorrMix generates continuous (normal, non-normal, or mixture distributions), binary, ordinal, and count
(Poisson or Negative Binomial, regular or zero-inflated) variables with a specified correlation matrix, or one continuous variable
with a mixture distribution. This package can be used to simulate data sets that mimic real-world clinical or genetic data sets
(i.e. plasmodes, as in Vaughan et al., 2009, doi: 10.1016/j.csda.2008.02.032). The methods extend those found in the
SimMultiCorrData package. Standard normal variables with an imposed intermediate correlation matrix are transformed to
generate the desired distributions. Continuous variables are simulated using either Fleishman's third-order
(doi: 10.1007/BF02293811) or Headrick's fifth-order (doi: 10.1016/S0167-9473(02)00072-5) power method transformation (PMT).
Non-mixture distributions require the user to specify mean, variance, skewness, standardized kurtosis, and standardized fifth and
sixth cumulants. Mixture distributions require these inputs for the component distributions plus the mixing probabilities. Simulation
occurs at the component-level for continuous mixture distributions. The target correlation matrix is specified in terms of
correlations with components of continuous mixture variables. These components are transformed into
the desired mixture variables using random multinomial variables based on the mixing probabilities. However, the package provides functions to approximate expected
correlations with continuous mixture variables given target correlations with the components. Binary and ordinal variables are simulated using a modification of
`GenOrd-package`

's `ordsample`

function. Count variables are simulated using the inverse
CDF method. There are two simulation pathways which calculate intermediate correlations involving count variables differently.
Correlation Method 1 adapts Yahav and Shmueli's 2012 method (doi: 10.1002/asmb.901) and performs best with large count variable means and
positive correlations or small means and negative correlations. Correlation Method 2 adapts Barbiero and
Ferrari's 2015 modification of `GenOrd-package`

(doi: 10.1002/asmb.2072) and performs best under the opposite scenarios.
The optional error loop may be used to improve the accuracy of the final correlation matrix. The package also provides functions to calculate the standardized
cumulants of continuous mixture distributions, check parameter inputs, calculate feasible correlation boundaries, and summarize and plot simulated variables.

There are several vignettes which accompany this package to help the user understand the simulation and analysis methods.

1) **Comparison of Correlation Methods 1 and 2** describes the two simulation pathways that can be followed for generation of
correlated data.

2) **Continuous Mixture Distributions** demonstrates how to simulate one continuous mixture variable using
`contmixvar1`

and gives a step-by-step guideline for comparing a simulated distribution to the target
distribution.

3) **Expected Cumulants and Correlations for Continuous Mixture Variables** derives the equations used by the function
`calc_mixmoments`

to find the mean, standard deviation, skew, standardized kurtosis, and standardized fifth
and sixth cumulants for a continuous mixture variable. The vignette also explains how the functions
`rho_M1M2`

and `rho_M1Y`

approximate the expected correlations with continuous mixture
variables based on the target correlations with the components.

4) **Overall Workflow for Generation of Correlated Data** gives a step-by-step guideline to follow with an example containing
continuous non-mixture and mixture, ordinal, zero-inflated Poisson, and zero-inflated Negative Binomial variables. It executes both
correlated data simulation functions with and without the error loop.

5) **Variable Types** describes the different types of variables that can be simulated in SimCorrMix, details the algorithm
involved in the optional error loop that helps to minimize correlation errors, and explains how the feasible correlation boundaries are
calculated for each of the two simulation pathways.

This package contains 3 *simulation* functions:

`contmixvar1`

, `corrvar`

, and `corrvar2`

4 data description (*summary*) function:

`calc_mixmoments`

, `summary_var`

, `rho_M1M2`

, `rho_M1Y`

2 *graphing* functions:

`plot_simpdf_theory`

, `plot_simtheory`

3 *support* functions:

`validpar`

, `validcorr`

, `validcorr2`

and 16 *auxiliary* functions (should not normally be called by the user, but are called by other functions):

`corr_error`

, `intercorr`

, `intercorr2`

,
`intercorr_cat_nb`

, `intercorr_cat_pois`

,

`intercorr_cont_nb`

, `intercorr_cont_nb2`

,
`intercorr_cont_pois`

, `intercorr_cont_pois2`

,

`intercorr_cont`

, `intercorr_nb`

, `intercorr_pois`

,
`intercorr_pois_nb`

, `maxcount_support`

,
`ord_norm`

, `norm_ord`

Amatya A & Demirtas H (2015). Simultaneous generation of multivariate mixed data with Poisson and normal marginals. Journal of Statistical Computation and Simulation, 85(15):3129-39. doi: 10.1080/00949655.2014.953534.

Barbiero A & Ferrari PA (2015). Simulation of correlated Poisson variables. Applied Stochastic Models in Business and Industry, 31:669-80. doi: 10.1002/asmb.2072.

Barbiero A & Ferrari PA (2015). GenOrd: Simulation of Discrete Random Variables with Given
Correlation Matrix and Marginal Distributions. R package version 1.4.0.

https://CRAN.R-project.org/package=GenOrd

Carnell R (2017). triangle: Provides the Standard Distribution Functions for the Triangle Distribution. R package version 0.11. https://CRAN.R-project.org/package=triangle.

Davenport JW, Bezder JC, & Hathaway RJ (1988). Parameter Estimation for Finite Mixture Distributions. Computers & Mathematics with Applications, 15(10):819-28.

Demirtas H (2006). A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical
Computation and Simulation, 76(11):1017-1025.

doi: 10.1080/10629360600569246.

Demirtas H (2014). Joint Generation of Binary and Nonnormal Continuous Data. Biometrics & Biostatistics, S12.

Demirtas H & Hedeker D (2011). A practical way for computing approximate lower and upper correlation bounds. American Statistician, 65(2):104-109. doi: 10.1198/tast.2011.10090.

Demirtas H, Hedeker D, & Mermelstein RJ (2012). Simulation of massive public health data by power polynomials. Statistics in Medicine, 31(27):3337-3346. doi: 10.1002/sim.5362.

Emrich LJ & Piedmonte MR (1991). A Method for Generating High-Dimensional Multivariate Binary Variables. The American Statistician, 45(4): 302-4. doi: 10.1080/00031305.1991.10475828.

Everitt BS (1996). An Introduction to Finite Mixture Distributions. Statistical Methods in Medical Research, 5(2):107-127. doi: 10.1177/096228029600500202.

Ferrari PA & Barbiero A (2012). Simulating ordinal data. Multivariate Behavioral Research, 47(4): 566-589. doi: 10.1080/00273171.2012.692630.

Fialkowski AC (2018). SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. R package version 0.2.2. https://CRAN.R-project.org/package=SimMultiCorrData.

Fleishman AI (1978). A Method for Simulating Non-normal Distributions. Psychometrika, 43:521-532. doi: 10.1007/BF02293811.

Frechet M (1951). Sur les tableaux de correlation dont les marges sont donnees. Ann. l'Univ. Lyon SectA, 14:53-77.

Hasselman B (2018). nleqslv: Solve Systems of Nonlinear Equations. R package version 3.3.2. https://CRAN.R-project.org/package=nleqslv

Headrick TC (2002). Fast Fifth-order Polynomial Transforms for Generating Univariate and Multivariate Non-normal Distributions. Computational Statistics & Data Analysis, 40(4):685-711. doi: 10.1016/S0167-9473(02)00072-5. (ScienceDirect)

Headrick TC, Kowalchuk RK (2007). The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data. Journal of Statistical Computation and Simulation, 77:229-249. doi: 10.1080/10629360600605065.

Headrick TC, Sawilowsky SS (1999). Simulating Correlated Non-normal Distributions: Extending the Fleishman Power Method. Psychometrika, 64:25-35. doi: 10.1007/BF02294317.

Headrick TC, Sheng Y, & Hodis FA (2007). Numerical Computing and Graphics for the Power Method Transformation Using
Mathematica. Journal of Statistical Software, 19(3):1 - 17.

doi: 10.18637/jss.v019.i03.

Higham N (2002). Computing the nearest correlation matrix - a problem from finance; IMA Journal of Numerical Analysis 22:329-343.

Hoeffding W. Scale-invariant correlation theory. In: Fisher NI, Sen PK, editors. The collected works of Wassily Hoeffding. New York: Springer-Verlag; 1994. p. 57-107.

Ismail N & Zamani H (2013). Estimation of Claim Count Data Using Negative Binomial, Generalized Poisson, Zero-Inflated Negative Binomial and Zero-Inflated Generalized Poisson Regression Models. Casualty Actuarial Society E-Forum 41(20):1-28.

Kendall M & Stuart A (1977). The Advanced Theory of Statistics, 4th Edition. Macmillan, New York.

Lambert D (1992). Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics 34(1):1-14.

Olsson U, Drasgow F, & Dorans NJ (1982). The Polyserial Correlation Coefficient. Psychometrika, 47(3):337-47. doi: 10.1007/BF02294164.

Pearson RK (2011). Exploring Data in Engineering, the Sciences, and Medicine. In. New York: Oxford University Press.

Schork NJ, Allison DB, & Thiel B (1996). Mixture Distributions in Human Genetics Research. Statistical Methods in Medical Research, 5:155-178. doi: 10.1177/096228029600500204.

Vale CD & Maurelli VA (1983). Simulating Multivariate Nonnormal Distributions. Psychometrika, 48:465-471. doi: 10.1007/BF02293687.

Vaughan LK, Divers J, Padilla M, Redden DT, Tiwari HK, Pomp D, Allison DB (2009). The use of plasmodes as a supplement to simulations: A simple example evaluating individual admixture estimation methodologies. Comput Stat Data Anal, 53(5):1755-66. doi: 10.1016/j.csda.2008.02.032.

Yahav I & Shmueli G (2012). On Generating Multivariate Poisson Data in Management Science Applications. Applied Stochastic Models in Business and Industry, 28(1):91-102. doi: 10.1002/asmb.901.

Yee TW (2018). VGAM: Vector Generalized Linear and Additive Models. R package version 1.0-5. https://CRAN.R-project.org/package=VGAM.

Zhang X, Mallick H, & Yi N (2016). Zero-Inflated Negative Binomial Regression for Differential Abundance Testing in Microbiome Studies. Journal of Bioinformatics and Genomics 2(2):1-9. doi: 10.18454/jbg.2016.2.2.1.

Useful link: https://github.com/AFialkowski/SimMultiCorrData, https://github.com/AFialkowski/SimCorrMix

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.