Description Usage Arguments Value Overview of Method 2 Choice of Fleishman's thirdorder or Headrick's fifthorder method Reasons for Function Errors References See Also Examples
This function simulates k_cat
ordinal (r ≥ 2 categories), k_cont
continuous nonmixture,
k_mix
continuous mixture, k_pois
Poisson (regular and zeroinflated), and/or k_nb
Negative Binomial
(regular and zeroinflated) variables with a specified correlation matrix rho
. The variables are generated from
multivariate normal variables with intermediate correlation matrix Sigma
, calculated by intercorr2
,
and then transformed. The intermediate correlations involving count variables are determined using correlation method 2.
The ordering of the variables in rho
must be 1st ordinal, 2nd continuous nonmixture,
3rd components of the continuous mixture, 4th regular Poisson, 5th zeroinflated Poisson, 6th regular NB, and 7th zeroinflated NB.
Note that it is possible for k_cat
, k_cont
, k_mix
, k_pois
, and/or k_nb
to be 0.
Simulation occurs at the componentlevel for continuous mixture distributions. The target correlation matrix is specified in terms of
correlations with components of continuous mixture variables. There are no parameter input checks
in order to decrease simulation time. All inputs should be checked prior to simulation with validpar
and validcorr2
. Summaries for the simulation results can be obtained with summary_var
.
All continuous variables are simulated using either Fleishman's thirdorder (method
= "Fleishman", doi: 10.1007/BF02293811) or Headrick's fifthorder
(method
= "Polynomial", doi: 10.1016/S01679473(02)000725) power method transformation. It works by matching standardized
cumulants – the first four (mean, variance, skew, and standardized kurtosis) for Fleishman's method, or the first six (mean,
variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick's method. The transformation is
expressed as follows:
Y = c_0 + c_1 * Z + c_2 * Z^2 + c_3 * Z^3 + c_4 * Z^4 + c_5 * Z^5, Z \sim N(0,1),
where c_4 and c_5 both equal 0 for Fleishman's method. The real constants are calculated by
find_constants
. Continuous mixture variables are generated componentwise and then transformed to
the desired mixture variables based on random multinomial variables generated from the mixing probabilities. Ordinal variables
(r ≥ 2 categories) are generated by discretizing the standard normal
variables at quantiles. These quantiles are determined by evaluating the inverse standard normal CDF at the cumulative
probabilities defined by each variable's marginal distribution. Count variables are generated using the inverse CDF method. The
CDF of a standard normal variable has a uniform distribution. The appropriate quantile function (F_Y)^(1) is applied to
this uniform variable with the designated parameters to generate the count variable: Y = (F_Y)^(1)(Phi(Z)). The Negative
Binomial variable represents the number of failures which occur in a sequence of Bernoulli trials before the target number of
successes is achieved. Zeroinflated Poisson or NB variables are obtained by setting the probability of a structural zero to be
greater than 0. The optional error loop attempts to correct the final pairwise correlations to be within a userspecified
precision value (epsilon
) of the target correlations.
The vignette Variable Types discusses how each of the different variables are generated and describes the required parameters.
The vignette Overall Workflow for Generation of Correlated Data provides a detailed example discussing the stepbystep simulation process and comparing correlation methods 1 and 2.
1 2 3 4 5 6 7 8 9 10 11  corrvar2(n = 10000, k_cat = 0, k_cont = 0, k_mix = 0, k_pois = 0,
k_nb = 0, method = c("Fleishman", "Polynomial"), means = NULL,
vars = NULL, skews = NULL, skurts = NULL, fifths = NULL,
sixths = NULL, Six = list(), mix_pis = list(), mix_mus = list(),
mix_sigmas = list(), mix_skews = list(), mix_skurts = list(),
mix_fifths = list(), mix_sixths = list(), mix_Six = list(),
marginal = list(), support = list(), lam = NULL, p_zip = 0,
size = NULL, prob = NULL, mu = NULL, p_zinb = 0, pois_eps = 0.0001,
nb_eps = 0.0001, rho = NULL, seed = 1234, errorloop = FALSE,
epsilon = 0.001, maxit = 1000, use.nearPD = TRUE, Sigma = NULL,
cstart = list(), quiet = FALSE)

n 
the sample size (i.e. the length of each simulated variable; default = 10000) 
k_cat 
the number of ordinal (r >= 2 categories) variables (default = 0) 
k_cont 
the number of continuous nonmixture variables (default = 0) 
k_mix 
the number of continuous mixture variables (default = 0) 
k_pois 
the number of regular Poisson and zeroinflated Poisson variables (default = 0) 
k_nb 
the number of regular Negative Binomial and zeroinflated Negative Binomial variables (default = 0) 
method 
the method used to generate the 
means 
a vector of means for the 
vars 
a vector of variances for the 
skews 
a vector of skewness values for the 
skurts 
a vector of standardized kurtoses (kurtosis  3, so that normal variables have a value of 0)
for the 
fifths 
a vector of standardized fifth cumulants for the 
sixths 
a vector of standardized sixth cumulants for the 
Six 
a list of vectors of sixth cumulant correction values for the 
mix_pis 
a list of length 
mix_mus 
a list of length 
mix_sigmas 
a list of length 
mix_skews 
a list of length 
mix_skurts 
a list of length 
mix_fifths 
a list of length 
mix_sixths 
a list of length 
mix_Six 
a list of length 
marginal 
a list of length equal to 
support 
a list of length equal to 
lam 
a vector of lambda (mean > 0) constants for the Poisson variables (see 
p_zip 
a vector of probabilities of structural zeros (not including zeros from the Poisson distribution) for the
zeroinflated Poisson variables (see 
size 
a vector of size parameters for the Negative Binomial variables (see 
prob 
a vector of success probability parameters for the NB variables; order the same as in 
mu 
a vector of mean parameters for the NB variables (*Note: either 
p_zinb 
a vector of probabilities of structural zeros (not including zeros from the NB distribution) for the zeroinflated NB variables
(see 
pois_eps 
a vector of length 
nb_eps 
a vector of length 
rho 
the target correlation matrix which must be ordered
1st ordinal, 2nd continuous nonmixture, 3rd components of continuous mixtures, 4th regular Poisson, 5th zeroinflated Poisson,
6th regular NB, 7th zeroinflated NB; note that 
seed 
the seed value for random number generation (default = 1234) 
errorloop 
if TRUE, uses 
epsilon 
the maximum acceptable error between the final and target pairwise correlations (default = 0.001)
in the calculation of ordinal intermediate correlations with 
maxit 
the maximum number of iterations to use (default = 1000) in the calculation of ordinal
intermediate correlations with 
use.nearPD 
TRUE to convert the overall intermediate correlation matrix to the nearest positive definite matrix with 
Sigma 
an intermediate correlation matrix to use if the user wants to provide one, else it is calculated within by

cstart 
a list of length equal to 
quiet 
if FALSE prints simulation messages, if TRUE suppresses message printing 
A list whose components vary based on the type of simulated variables.
If ordinal variables are produced: Y_cat
the ordinal variables,
If continuous variables are produced:
constants
a data.frame of the constants,
Y_cont
the continuous nonmixture variables,
Y_comp
the components of the continuous mixture variables,
Y_mix
the continuous mixture variables,
sixth_correction
a list of sixth cumulant correction values,
valid.pdf
a vector where the ith element is "TRUE" if the constants for the ith continuous variable generate a valid PDF, else "FALSE"
If Poisson variables are produced: Y_pois
the regular and zeroinflated Poisson variables,
If Negative Binomial variables are produced: Y_nb
the regular and zeroinflated Negative Binomial variables,
Additionally, the following elements:
Sigma
the intermediate correlation matrix (after the error loop),
Error_Time
the time in minutes required to use the error loop,
Time
the total simulation time in minutes,
niter
a matrix of the number of iterations used for each variable in the error loop,
The intermediate correlations used in method 2 are less simulation based than those in method 1, and no seed is needed. Their calculations involve greater utilization of correction loops which make iterative adjustments until a maximum error has been reached (if possible). In addition, method 2 differs from method 1 in the following ways:
1) The intermediate correlations involving count variables are based on the methods of Barbiero & Ferrari (2012,
doi: 10.1080/00273171.2012.692630, 2015, doi: 10.1002/asmb.2072).
The Poisson or Negative Binomial support is made finite by removing a small userspecified value (i.e. 1e06) from the total
cumulative probability. This truncation factor may differ for each count variable. The count variables are subsequently
treated as ordinal and intermediate correlations are calculated using the correction loop of
ord_norm
.
2) The continuous  count variable correlations are based on an extension of the method of Demirtas et al. (2012, doi: 10.1002/sim.5362), and the count variables are treated as ordinal. The correction factor is the product of the power method correlation between the continuous variable and the normal variable used to generate it (see Headrick & Kowalchuk, 2007, doi: 10.1080/10629360600605065) and the pointpolyserial correlation between the ordinalized count variable and the normal variable used to generate it (see Olsson et al., 1982, doi: 10.1007/BF02294164). The intermediate correlations are the ratio of the target correlations to the correction factor.
Please see the Comparison of Correlation Methods 1 and 2 vignette for more information and a stepbystep overview of the simulation process.
Using the fifthorder approximation allows additional control over the fifth and sixth moments of the generated distribution, improving
accuracy. In addition, the range of feasible standardized kurtosis (γ_2) values, given skew (γ_1) and standardized fifth (γ_3) and sixth
(γ_4) cumulants, is larger than with Fleishman's method (see calc_lower_skurt
).
For example, the Fleishman method can not be used to generate a nonnormal distribution with a ratio of
γ_1^2/γ_2 > 9/14 (see Headrick & Kowalchuk, 2007). This eliminates the Chisquared family of distributions, which has
a constant ratio of γ_1^2/γ_2 = 2/3. The fifthorder method also generates more distributions with valid PDF's.
However, if the fifth and sixth cumulants are unknown or do not exist, the Fleishman approximation should be used.
1) The most likely cause for function errors is that no solutions to fleish
or
poly
converged when using find_constants
. If this happens,
the simulation will stop. It may help to first use find_constants
for each continuous variable to
determine if a sixth cumulant correction value is needed. The solutions can be used as starting values (see cstart
below).
If the standardized cumulants are obtained from calc_theory
, the user may need to use rounded values as inputs (i.e.
skews = round(skews, 8)
). For example, in order to ensure that skew is exactly 0 for symmetric distributions.
2) The kurtosis may be outside the region of possible values. There is an associated lower boundary for kurtosis associated
with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method). Use
calc_lower_skurt
to determine the boundary for a given set of cumulants.
3) The feasibility of the final correlation matrix rho
, given the
distribution parameters, should be checked first using validcorr2
. This function either checks
if a given rho
is plausible or returns the lower and upper final correlation limits. It should be noted that even if a target
correlation matrix is within the "plausible range," it still may not be possible to achieve the desired matrix. This happens most
frequently when generating ordinal variables or using negative correlations. The error loop frequently fixes these problems.
Barbiero A & Ferrari PA (2015). Simulation of correlated Poisson variables. Applied Stochastic Models in Business and Industry, 31:66980. doi: 10.1002/asmb.2072.
Barbiero A & Ferrari PA (2015). GenOrd: Simulation of Discrete Random Variables with Given
Correlation Matrix and Marginal Distributions. R package version 1.4.0.
https://CRAN.Rproject.org/package=GenOrd
Davenport JW, Bezder JC, & Hathaway RJ (1988). Parameter Estimation for Finite Mixture Distributions. Computers & Mathematics with Applications, 15(10):81928.
Demirtas H (2006). A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical
Computation and Simulation, 76(11):10171025.
doi: 10.1080/10629360600569246.
Demirtas H (2014). Joint Generation of Binary and Nonnormal Continuous Data. Biometrics & Biostatistics, S12.
Demirtas H, Hedeker D, & Mermelstein RJ (2012). Simulation of massive public health data by power polynomials. Statistics in Medicine, 31(27):33373346. doi: 10.1002/sim.5362.
Everitt BS (1996). An Introduction to Finite Mixture Distributions. Statistical Methods in Medical Research, 5(2):107127. doi: 10.1177/096228029600500202.
Ferrari PA & Barbiero A (2012). Simulating ordinal data. Multivariate Behavioral Research, 47(4): 566589. doi: 10.1080/00273171.2012.692630.
Fialkowski AC (2018). SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. R package version 0.2.2. https://CRAN.Rproject.org/package=SimMultiCorrData.
Fleishman AI (1978). A Method for Simulating Nonnormal Distributions. Psychometrika, 43:521532. doi: 10.1007/BF02293811.
Headrick TC (2002). Fast Fifthorder Polynomial Transforms for Generating Univariate and Multivariate Nonnormal Distributions. Computational Statistics & Data Analysis, 40(4):685711. doi: 10.1016/S01679473(02)000725. (ScienceDirect)
Headrick TC (2004). On Polynomial Transformations for Simulating Multivariate Nonnormal Distributions. Journal of Modern Applied Statistical Methods, 3(1):6571. doi: 10.22237/jmasm/1083370080.
Headrick TC, Kowalchuk RK (2007). The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data. Journal of Statistical Computation and Simulation, 77:229249. doi: 10.1080/10629360600605065.
Headrick TC, Sawilowsky SS (1999). Simulating Correlated Nonnormal Distributions: Extending the Fleishman Power Method. Psychometrika, 64:2535. doi: 10.1007/BF02294317.
Headrick TC, Sheng Y, & Hodis FA (2007). Numerical Computing and Graphics for the Power Method Transformation Using
Mathematica. Journal of Statistical Software, 19(3):1  17.
doi: 10.18637/jss.v019.i03.
Higham N (2002). Computing the nearest correlation matrix  a problem from finance; IMA Journal of Numerical Analysis 22:329343.
Ismail N & Zamani H (2013). Estimation of Claim Count Data Using Negative Binomial, Generalized Poisson, ZeroInflated Negative Binomial and ZeroInflated Generalized Poisson Regression Models. Casualty Actuarial Society EForum 41(20):128.
Lambert D (1992). ZeroInflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics 34(1):114.
Olsson U, Drasgow F, & Dorans NJ (1982). The Polyserial Correlation Coefficient. Psychometrika, 47(3):33747. doi: 10.1007/BF02294164.
Pearson RK (2011). Exploring Data in Engineering, the Sciences, and Medicine. In. New York: Oxford University Press.
Schork NJ, Allison DB, & Thiel B (1996). Mixture Distributions in Human Genetics Research. Statistical Methods in Medical Research, 5:155178. doi: 10.1177/096228029600500204.
Vale CD & Maurelli VA (1983). Simulating Multivariate Nonnormal Distributions. Psychometrika, 48:465471. doi: 10.1007/BF02293687.
Yee TW (2018). VGAM: Vector Generalized Linear and Additive Models. R package version 1.05. https://CRAN.Rproject.org/package=VGAM.
Zhang X, Mallick H, & Yi N (2016). ZeroInflated Negative Binomial Regression for Differential Abundance Testing in Microbiome Studies. Journal of Bioinformatics and Genomics 2(2):19. doi: 10.18454/jbg.2016.2.2.1.
find_constants
, validpar
, validcorr2
,
intercorr2
, corr_error
, summary_var
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90  Sim1 < corrvar2(n = 1000, k_cat = 1, k_cont = 1, method = "Polynomial",
means = 0, vars = 1, skews = 0, skurts = 0, fifths = 0, sixths = 0,
marginal = list(c(1/3, 2/3)), support = list(0:2),
rho = matrix(c(1, 0.4, 0.4, 1), 2, 2), quiet = TRUE)
## Not run:
# 2 continuous mixture, 1 binary, 1 zeroinflated Poisson, and
# 1 zeroinflated NB variable
n < 10000
seed < 1234
# Mixture variables: Normal mixture with 2 components;
# mixture of Logistic(0, 1), Chisq(4), Beta(4, 1.5)
# Find cumulants of components of 2nd mixture variable
L < calc_theory("Logistic", c(0, 1))
C < calc_theory("Chisq", 4)
B < calc_theory("Beta", c(4, 1.5))
skews < skurts < fifths < sixths < NULL
Six < list()
mix_pis < list(c(0.4, 0.6), c(0.3, 0.2, 0.5))
mix_mus < list(c(2, 2), c(L[1], C[1], B[1]))
mix_sigmas < list(c(1, 1), c(L[2], C[2], B[2]))
mix_skews < list(rep(0, 2), c(L[3], C[3], B[3]))
mix_skurts < list(rep(0, 2), c(L[4], C[4], B[4]))
mix_fifths < list(rep(0, 2), c(L[5], C[5], B[5]))
mix_sixths < list(rep(0, 2), c(L[6], C[6], B[6]))
mix_Six < list(list(NULL, NULL), list(1.75, NULL, 0.03))
Nstcum < calc_mixmoments(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]],
mix_skews[[1]], mix_skurts[[1]], mix_fifths[[1]], mix_sixths[[1]])
Mstcum < calc_mixmoments(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]],
mix_skews[[2]], mix_skurts[[2]], mix_fifths[[2]], mix_sixths[[2]])
means < c(Nstcum[1], Mstcum[1])
vars < c(Nstcum[2]^2, Mstcum[2]^2)
marginal < list(0.3)
support < list(c(0, 1))
lam < 0.5
p_zip < 0.1
pois_eps < 0.0001
size < 2
prob < 0.75
p_zinb < 0.2
nb_eps < 0.0001
k_cat < k_pois < k_nb < 1
k_cont < 0
k_mix < 2
Rey < matrix(0.39, 8, 8)
diag(Rey) < 1
rownames(Rey) < colnames(Rey) < c("O1", "M1_1", "M1_2", "M2_1", "M2_2",
"M2_3", "P1", "NB1")
# set correlation between components of the same mixture variable to 0
Rey["M1_1", "M1_2"] < Rey["M1_2", "M1_1"] < 0
Rey["M2_1", "M2_2"] < Rey["M2_2", "M2_1"] < Rey["M2_1", "M2_3"] < 0
Rey["M2_3", "M2_1"] < Rey["M2_2", "M2_3"] < Rey["M2_3", "M2_2"] < 0
# check parameter inputs
validpar(k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal, support,
lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey)
# check to make sure Rey is within the feasible correlation boundaries
validcorr2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal,
lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey, seed)
# simulate without the error loop
Sim2 < corrvar2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal, support,
lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey, seed,
epsilon = 0.01)
names(Sim2)
# simulate with the error loop
Sim2_EL < corrvar2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial",
means, vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus,
mix_sigmas, mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six,
marginal, support, lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps,
nb_eps, Rey, seed, errorloop = TRUE, epsilon = 0.01)
names(Sim2_EL)
## End(Not run)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.