Description Usage Arguments Value Generation of Continuous Non-Mixture and Mixture Variables Choice of Fleishman's third-order or Headrick's fifth-order method Reasons for Function Errors References See Also Examples
This function extends the techniques of Headrick and Beasley (2004, doi: 10.1081/SAC-120028431) to
create correlated systems of statistical equations containing continuous variables with normal,
non-normal, or mixture distributions. The method allows the user to control the distributions
for the stochastic disturbance (error) terms E and independent variables X. The user specifies the correlation structure
between X terms within an outcome and across outcomes. For a given equation, the user also specifies the correlation between
the outcome Y and X terms. These correlations are used to calculate the beta (slope) coefficients for the equations with
calc_betas
. If the system contains mixture variables and corr.yx
is specified in terms of non-mixture
and mixture variables, the betas
will be calculated in terms of non-mixture and mixture independent variables. If corr.yx
Finally, the user specifies the correlations across error terms. The assumptions are that
1) there are at least 2 equations and a total of at least 1 independent variable, 2) the independent variables are uncorrelated
with the error terms, 3) each equation has an error term,
and 4) all error terms have either a non-mixture or mixture distribution. The outcomes Y are calculated as the E terms
added to the products of the beta coefficients and the X terms. There are no
interactions between independent variables or distinction between subject and group-level terms (as in the hierarchical linear models approach
of corrsys
or corrsys2
). However, the user does not have to provide the beta coefficients
(except for the intercepts). Since the equations do not include random slopes (i.e. for the X terms), the effects of the independent
variables are all considered "fixed." However, a random intercept or a "time" effect with a random slope could be added by specifying
them as independent variables. There are no parameter input checks in order to decrease simulation time. All inputs should be checked
prior to simulation with checkpar
. Summaries of the simulation results can be found with summary_sys
.
The functions calc_corr_y
, calc_corr_yx
, and calc_corr_ye
use equations
from Headrick and Beasley (2004) to calculate the expected correlations for the outcomes, among a given outcome and covariates from
the other outcomes, and for the error terms. The vignette Theory and Equations for Correlated Systems of Continuous Variables
gives the equations, and the vignette Correlated Systems of Statistical Equations with Non-Mixture and Mixture Continuous
Variables gives examples. There are also vignettes in SimCorrMix
which provide more details on continuous
non-mixture and mixture variables.
1 2 3 4 5 6 7 8 9 10 | nonnormsys(n = 10000, M = NULL, method = c("Fleishman", "Polynomial"),
error_type = c("non_mix", "mix"), means = list(), vars = list(),
skews = list(), skurts = list(), fifths = list(), sixths = list(),
Six = list(), mix_pis = list(), mix_mus = list(), mix_sigmas = list(),
mix_skews = list(), mix_skurts = list(), mix_fifths = list(),
mix_sixths = list(), mix_Six = list(), same.var = NULL,
betas.0 = NULL, corr.x = list(), corr.yx = list(), corr.e = NULL,
seed = 1234, use.nearPD = TRUE, eigmin = 0, adjgrad = FALSE,
B1 = NULL, tau = 0.5, tol = 0.1, steps = 100, msteps = 10,
errorloop = FALSE, epsilon = 0.001, maxit = 1000, quiet = FALSE)
|
n |
the sample size (i.e. the length of each simulated variable; default = 10000) |
M |
the number of dependent variables Y (outcomes); equivalently, the number of equations in the system |
method |
the PMT method used to generate all continuous variables, including independent variables (covariates) and error terms; "Fleishman" uses Fleishman's third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation |
error_type |
"non_mix" if all error terms have continuous non-mixture distributions, "mix" if all error terms have continuous mixture distributions |
means |
a list of length |
vars |
a list of length |
skews |
a list of length |
skurts |
a list of length |
fifths |
a list of length |
sixths |
a list of length |
Six |
a list of length |
mix_pis |
a list of length |
mix_mus |
a list of length |
mix_sigmas |
a list of length |
mix_skews |
a list of length |
mix_skurts |
a list of length |
mix_fifths |
a list of length |
mix_sixths |
a list of length |
mix_Six |
a list of length |
same.var |
either a vector or a matrix; if a vector, |
betas.0 |
vector of length |
corr.x |
list of length |
corr.yx |
a list of length |
corr.e |
correlation matrix for continuous non-mixture or components of mixture error terms |
seed |
the seed value for random number generation (default = 1234) |
use.nearPD |
TRUE to convert the overall intermediate correlation matrix formed by the X (for all outcomes and independent
variables) or E to the nearest positive definite matrix with |
eigmin |
minimum replacement eigenvalue if overall intermediate correlation matrix is not positive-definite (default = 0) |
adjgrad |
TRUE to use |
B1 |
the initial matrix for algorithm; if NULL, uses a scaled initial matrix with diagonal elements |
tau |
parameter used to calculate theta (default = 0.5) |
tol |
maximum error for Frobenius norm distance between new matrix and original matrix (default = 0.1) |
steps |
maximum number of steps for k (default = 100) |
msteps |
maximum number of steps for m (default = 10) |
errorloop |
if TRUE, uses |
epsilon |
the maximum acceptable error between the final and target correlation matrices (default = 0.001) in the error loop |
maxit |
the maximum number of iterations to use (default = 1000) in the error loop |
quiet |
if FALSE prints messages, if TRUE suppresses messages |
A list with the following components:
Y
matrix with n
rows and M
columns of outcomes
X
list of length M
containing X_{cont(pj)}, X_{comp(pj)}
X_all
list of length M
containing X_{cont(pj)}, X_{mix(pj)}
E
matrix with n
rows containing continuous non-mixture or components of continuous mixture error terms
E_mix
matrix with n
rows containing continuous mixture error terms
betas
a matrix of the slope coefficients calculated with calc_betas
, rows represent the
outcomes
constants
a list of length M
with data.frames of the constants for the X_{cont(pj)},
X_comp(pj) and E_p
SixCorr
a list of length M
of lists of sixth cumulant correction values used to obtain valid
pdf's for the X_{cont(pj)}, X_comp(pj), and E_p
valid.pdf
a list of length M
of vectors where the i-th element is "TRUE" if the constants for the i-th
continuous variable generate a valid pdf, else "FALSE"
Sigma_X0
matrix of intermediate correlations calculated by intercorr
Sigma_X
matrix of intermediate correlations after nearPD
or adj_grad
function has been used;
applied to generate the normal variables transformed to get the desired distributions
Error_Time
the time in minutes required to use the error loop
Time
the total simulation time in minutes
niter
a matrix of the number of iterations used in the error loop
Mixture distributions describe random variables that are drawn from more than one component distribution. For a random variable X_{mix} from a finite continuous mixture distribution with k components, the probability density function (PDF) can be described by:
h_X(x) = ∑_{i=1}^{k} π_i f_{X_{comp_i}}(x), ∑_{i=1}^{k} π_i = 1.
The π_i are mixing parameters which determine the weight of each component distribution f_{X_{comp_i}}(x) in the overall probability distribution. As long as each component has a valid PDF, the overall distribution h_X() has a valid PDF. The main assumption is statistical independence between the process of randomly selecting the component distribution and the distributions themselves. Simulation is done at the component-level for mixture variables.
All continuous variables are simulated using either Fleishman's third-order (method
= "Fleishman",
doi: 10.1007/BF02293811) or Headrick's fifth-order (method
= "Polynomial", doi: 10.1016/S0167-9473(02)00072-5)
power method transformation (PMT). It works by matching standardized cumulants – the first four (mean,
variance, skew, and standardized kurtosis) for Fleishman's method, or the first six (mean, variance,
skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick's method. The transformation is expressed
as follows:
Y = c_{0} + c_{1} * Z + c_{2} * Z^2 + c_{3} * Z^3 + c_{4} * Z^4 + c_{5} * Z^5, Z \sim N(0,1),
where c_{4} and c_{5} both equal 0 for Fleishman's method. The real constants are calculated by
find_constants
for non-mixture and components of mixture variables. Continuous mixture
variables are generated componentwise and then transformed to the desired mixture variables using random multinomial variables
generated based on mixing probabilities. The correlation matrices are specified in terms of correlations with components of the
mixture variables.
Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving
accuracy. In addition, the range of feasible standardized kurtosis values, given skew and standardized fifth (γ_3) and sixth
(γ_4) cumulants, is larger than with Fleishman's method (see calc_lower_skurt
).
For example, the Fleishman method can not be used to generate a non-normal distribution with a ratio of
γ_3^2/γ_4 > 9/14 (see Headrick & Kowalchuk, 2007). This eliminates the Chi-squared family of distributions, which has
a constant ratio of γ_3^2/γ_4 = 2/3. The fifth-order method also generates more distributions with valid PDF's.
However, if the fifth and sixth cumulants are unknown or do not exist, the Fleishman approximation should be used.
1) The most likely cause for function errors is that the parameter inputs are mispecified. Using checkpar
prior to simulation can help decrease these errors.
2) No solutions to fleish
or
poly
converged when using find_constants
. If this happens,
the simulation will stop. It may help to first use find_constants
for each continuous variable to
determine if a sixth cumulant correction value is needed. If the standardized cumulants are obtained from calc_theory
, the user
may need to use rounded values as inputs (i.e. skews = round(skews, 8)
). For example, in order to ensure that skew is exactly 0 for symmetric distributions.
3) The kurtosis for a continuous variable may be outside the region of possible values. There is an associated lower kurtosis boundary for
associated with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method). Use
calc_lower_skurt
to determine the boundary for a given set of cumulants.
4) No solutions to calc_betas
converged when trying to find the beta coefficients. Try different correlation
matrices.
Davenport JW, Bezder JC, & Hathaway RJ (1988). Parameter Estimation for Finite Mixture Distributions. Computers & Mathematics with Applications, 15(10):819-28.
Everitt BS (1996). An Introduction to Finite Mixture Distributions. Statistical Methods in Medical Research, 5(2):107-127. doi: 10.1177/096228029600500202.
Fialkowski AC (2017). SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. R package version 0.2.1. https://CRAN.R-project.org/package=SimMultiCorrData.
Fialkowski AC (2018). SimCorrMix: Simulation of Correlated Data of Multiple Variable Types including Continuous and Count Mixture Distributions. R package version 0.1.0. https://github.com/AFialkowski/SimCorrMix
Fleishman AI (1978). A Method for Simulating Non-normal Distributions. Psychometrika, 43:521-532. doi: 10.1007/BF02293811.
Headrick TC (2002). Fast Fifth-order Polynomial Transforms for Generating Univariate and Multivariate Non-normal Distributions. Computational Statistics & Data Analysis, 40(4):685-711. doi: 10.1016/S0167-9473(02)00072-5. (ScienceDirect)
Headrick TC (2004). On Polynomial Transformations for Simulating Multivariate Nonnormal Distributions. Journal of Modern Applied Statistical Methods, 3(1):65-71. doi: 10.22237/jmasm/1083370080.
Headrick TC, Beasley TM (2004). A Method for Simulating Correlated Non-Normal Systems of Linear Statistical Equations. Communications in Statistics - Simulation and Computation, 33(1). doi: 10.1081/SAC-120028431
Headrick TC, Kowalchuk RK (2007). The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data. Journal of Statistical Computation and Simulation, 77:229-249. doi: 10.1080/10629360600605065.
Headrick TC, Sawilowsky SS (1999). Simulating Correlated Non-normal Distributions: Extending the Fleishman Power Method. Psychometrika, 64:25-35. doi: 10.1007/BF02294317.
Headrick TC, Sheng Y, & Hodis FA (2007). Numerical Computing and Graphics for the Power Method Transformation Using
Mathematica. Journal of Statistical Software, 19(3):1 - 17.
doi: 10.18637/jss.v019.i03.
Higham N (2002). Computing the nearest correlation matrix - a problem from finance; IMA Journal of Numerical Analysis 22:329-343.
McCulloch CE, Searle SR, Neuhaus JM (2008). Generalized, Linear, and Mixed Models (2nd ed.). Wiley Series in Probability and Statistics. Hoboken, New Jersey: John Wiley & Sons, Inc.
Pearson RK (2011). Exploring Data in Engineering, the Sciences, and Medicine. In. New York: Oxford University Press.
Schork NJ, Allison DB, & Thiel B (1996). Mixture Distributions in Human Genetics Research. Statistical Methods in Medical Research, 5:155-178. doi: 10.1177/096228029600500204.
Vale CD & Maurelli VA (1983). Simulating Multivariate Nonnormal Distributions. Psychometrika, 48:465-471. doi: 10.1007/BF02293687.
calc_betas
, calc_corr_y
, calc_corr_yx
,
calc_corr_ye
, checkpar
, summary_sys
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | M <- 3
B <- calc_theory("Beta", c(4, 1.5))
skews <- lapply(seq_len(M), function(x) c(0, B[3]))
skurts <- lapply(seq_len(M), function(x) c(0, B[4]))
fifths <- lapply(seq_len(M), function(x) c(0, B[5]))
sixths <- lapply(seq_len(M), function(x) c(0, B[6]))
Six <- lapply(seq_len(M), function(x) list(NULL, 0.03))
corr.e <- matrix(c(1, 0.4, 0.4^2, 0.4, 1, 0.4, 0.4^2, 0.4, 1), M, M,
byrow = TRUE)
means <- lapply(seq_len(M), function(x) c(0, B[1]))
vars <- lapply(seq_len(M), function(x) c(1, B[2]^2))
corr.x <- list(list(matrix(1, 1, 1), matrix(0.4, 1, 1), matrix(0.4, 1, 1)),
list(matrix(0.4, 1, 1), matrix(1, 1, 1), matrix(0.4, 1, 1)),
list(matrix(0.4, 1, 1), matrix(0.4, 1, 1), matrix(1, 1, 1)))
corr.yx <- list(matrix(0.4, 1), matrix(0.5, 1), matrix(0.6, 1))
Sys1 <- nonnormsys(10000, M, "Polynomial", "non_mix", means, vars,
skews, skurts, fifths, sixths, Six, corr.x = corr.x, corr.yx = corr.yx,
corr.e = corr.e)
## Not run:
# Example: system of three equations for 2 independent variables, where each
# error term has unit variance, from Headrick & Beasley (2002)
# Y_1 = beta_10 + beta_11 * X_11 + beta_12 * X_12 + sigma_1 * e_1
# Y_2 = beta_20 + beta_21 * X_21 + beta_22 * X_22 + sigma_2 * e_2
# Y_3 = beta_30 + beta_31 * X_31 + beta_32 * X_32 + sigma_3 * e_3
# X_11 = X_21 = X_31 = Exponential(2)
# X_12 = X_22 = X_32 = Laplace(0, 1)
# e_1 = e_2 = e_3 = Cauchy(0, 1)
seed <- 1234
M <- 3
Stcum1 <- calc_theory("Exponential", 2)
Stcum2 <- calc_theory("Laplace", c(0, 1))
Stcum3 <- c(0, 1, 0, 25, 0, 1500) # taken from paper
means <- lapply(seq_len(M), function(x) c(0, 0, 0))
vars <- lapply(seq_len(M), function(x) c(1, 1, 1))
skews <- lapply(seq_len(M), function(x) c(Stcum1[3], Stcum2[3], Stcum3[3]))
skurts <- lapply(seq_len(M), function(x) c(Stcum1[4], Stcum2[4], Stcum3[4]))
fifths <- lapply(seq_len(M), function(x) c(Stcum1[5], Stcum2[5], Stcum3[5]))
sixths <- lapply(seq_len(M), function(x) c(Stcum1[6], Stcum2[6], Stcum3[6]))
# No sixth cumulant corrections will be used in order to match the results
# from the paper.
corr.yx <- list(matrix(c(0.4, 0.4), 1), matrix(c(0.5, 0.5), 1),
matrix(c(0.6, 0.6), 1))
corr.x <- list()
corr.x[[1]] <- corr.x[[2]] <- corr.x[[3]] <- list()
corr.x[[1]][[1]] <- matrix(c(1, 0.1, 0.1, 1), 2, 2)
corr.x[[1]][[2]] <- matrix(c(0.1974318, 0.1859656, 0.1879483, 0.1858601),
2, 2, byrow = TRUE)
corr.x[[1]][[3]] <- matrix(c(0.2873190, 0.2589830, 0.2682057, 0.2589542),
2, 2, byrow = TRUE)
corr.x[[2]][[1]] <- t(corr.x[[1]][[2]])
corr.x[[2]][[2]] <- matrix(c(1, 0.35, 0.35, 1), 2, 2)
corr.x[[2]][[3]] <- matrix(c(0.5723303, 0.4883054, 0.5004441, 0.4841808),
2, 2, byrow = TRUE)
corr.x[[3]][[1]] <- t(corr.x[[1]][[3]])
corr.x[[3]][[2]] <- t(corr.x[[2]][[3]])
corr.x[[3]][[3]] <- matrix(c(1, 0.7, 0.7, 1), 2, 2)
corr.e <- matrix(0.4, nrow = 3, ncol = 3)
diag(corr.e) <- 1
# Check the parameter inputs
checkpar(M, "Polynomial", "non_mix", means, vars, skews,
skurts, fifths, sixths, corr.x = corr.x, corr.yx = corr.yx,
corr.e = corr.e)
# Generate the system
Sys1 <- nonnormsys(10000, M, "Polynomial", "non_mix", means, vars, skews,
skurts, fifths, sixths, corr.x = corr.x, corr.yx = corr.yx,
corr.e = corr.e, seed = seed)
# Summarize the results
Sum1 <- summary_sys(Sys1$Y, Sys1$E, E_mix = NULL, Sys1$X, X_all = list(), M,
"Polynomial", means, vars, skews, skurts, fifths, sixths, corr.x = corr.x,
corr.e = corr.e)
# Calculate theoretical correlations for comparison to simulated values
calc_corr_y(Sys1$betas, corr.x, corr.e, vars)
Sum1$rho.y
calc_corr_ye(Sys1$betas, corr.x, corr.e, vars)
Sum1$rho.ye
calc_corr_yx(Sys1$betas, corr.x, vars)
Sum1$rho.yx
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.