R/corrvar2.R
In SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions

Documented in corrvar2

#' @title Generation of Correlated Ordinal, Continuous (mixture and non-mixture), and/or Count (Poisson and Negative Binomial,
#'     regular and zero-inflated) Variables: Correlation Method 2
#'
#' @description This function simulates \code{k_cat} ordinal (\eqn{r \ge 2} categories), \code{k_cont} continuous non-mixture,
#'     \code{k_mix} continuous mixture, \code{k_pois} Poisson (regular and zero-inflated), and/or \code{k_nb} Negative Binomial
#'     (regular and zero-inflated) variables with a specified correlation matrix \code{rho}.  The variables are generated from
#'     multivariate normal variables with intermediate correlation matrix \code{Sigma}, calculated by \code{\link[SimCorrMix]{intercorr2}},
#'     and then transformed.  The intermediate correlations involving count variables are determined using \strong{correlation method 2}.
#'     The \emph{ordering} of the variables in \code{rho} must be 1st ordinal, 2nd continuous non-mixture,
#'     3rd components of the continuous mixture, 4th regular Poisson, 5th zero-inflated Poisson, 6th regular NB, and 7th zero-inflated NB.
#'     Note that it is possible for \code{k_cat}, \code{k_cont}, \code{k_mix}, \code{k_pois}, and/or \code{k_nb} to be 0.
#'     Simulation occurs at the component-level for continuous mixture distributions.  The target correlation matrix is specified in terms of
#'     correlations with components of continuous mixture variables.  There are no parameter input checks
#'     in order to decrease simulation time.  All inputs should be checked prior to simulation with \code{\link[SimCorrMix]{validpar}}
#'     and \code{\link[SimCorrMix]{validcorr2}}.  Summaries for the simulation results can be obtained with \code{\link[SimCorrMix]{summary_var}}.
#'
#'     All continuous variables are simulated using either Fleishman's third-order (\code{method} = "Fleishman", \doi{10.1007/BF02293811}) or Headrick's fifth-order
#'     (\code{method} = "Polynomial", \doi{10.1016/S0167-9473(02)00072-5}) power method transformation.  It works by matching standardized
#'     cumulants -- the first four (mean, variance, skew, and standardized kurtosis) for Fleishman's method, or the first six (mean,
#'     variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick's method.  The transformation is
#'     expressed as follows:
#'
#'     \deqn{Y = c_0 + c_1 * Z + c_2 * Z^2 + c_3 * Z^3 + c_4 * Z^4 + c_5 * Z^5,  Z \sim N(0,1),}
#'
#'     where \eqn{c_4} and \eqn{c_5} both equal \eqn{0} for Fleishman's method.  The real constants are calculated by \cr
#'     \code{\link[SimMultiCorrData]{find_constants}}.  Continuous mixture variables are generated componentwise and then transformed to
#'     the desired mixture variables based on random multinomial variables generated from the mixing probabilities.  Ordinal variables
#'     (\eqn{r \ge 2} categories) are generated by discretizing the standard normal
#'     variables at quantiles.  These quantiles are determined by evaluating the inverse standard normal CDF at the cumulative
#'     probabilities defined by each variable's marginal distribution.  Count variables are generated using the inverse CDF method.  The
#'     CDF of a standard normal variable has a uniform distribution.  The appropriate quantile function (F_Y)^(-1) is applied to
#'     this uniform variable with the designated parameters to generate the count variable: Y = (F_Y)^(-1)(Phi(Z)).  The Negative
#'     Binomial variable represents the number of failures which occur in a sequence of Bernoulli trials before the target number of
#'     successes is achieved.  Zero-inflated Poisson or NB variables are obtained by setting the probability of a structural zero to be
#'     greater than 0.  The optional error loop attempts to correct the final pairwise correlations to be within a user-specified
#'     precision value (\code{epsilon}) of the target correlations.
#'
#'     The vignette \bold{Variable Types} discusses how each of the different variables are generated and describes the required parameters.
#'
#'     The vignette \bold{Overall Workflow for Generation of Correlated Data} provides a detailed example discussing the step-by-step simulation process and comparing correlation methods 1 and 2.
#'
#' @section Overview of Method 2:
#'     The intermediate correlations used in method 2 are less simulation based than those in method 1, and no seed is needed.
#'     Their calculations involve greater utilization of correction loops which make iterative adjustments until a maximum error
#'     has been reached (if possible).  In addition, method 2 differs from method 1 in the following ways:
#'
#'     1) The intermediate correlations involving \bold{count variables} are based on the methods of Barbiero & Ferrari (2012,
#'     \doi{10.1080/00273171.2012.692630}, 2015, \doi{10.1002/asmb.2072}).
#'     The Poisson or Negative Binomial support is made finite by removing a small user-specified value (i.e. 1e-06) from the total
#'     cumulative probability.  This truncation factor may differ for each count variable.  The count variables are subsequently
#'     treated as ordinal and intermediate correlations are calculated using the correction loop of
#'     \code{\link[SimCorrMix]{ord_norm}}.
#'
#'     2) The \bold{continuous - count variable} correlations are based on an extension of the method of Demirtas et al. (2012,
#'     \doi{10.1002/sim.5362}), and the count
#'     variables are treated as ordinal.  The correction factor is the product of the power method correlation between the
#'     continuous variable and the normal variable used to generate it (see Headrick & Kowalchuk, 2007, \doi{10.1080/10629360600605065})
#'     and the point-polyserial correlation between the ordinalized count variable and the normal variable used to generate it (see Olsson et al., 1982,
#'     \doi{10.1007/BF02294164}).
#'     The intermediate correlations are the ratio of the target correlations to the correction factor.
#'
#'     Please see the \bold{Comparison of Correlation Methods 1 and 2} vignette for more information and a step-by-step overview of the simulation process.
#'
#' @section Choice of Fleishman's third-order or Headrick's fifth-order method:
#'     Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving
#'     accuracy.  In addition, the range of feasible standardized kurtosis (\eqn{\gamma_2}) values, given skew (\eqn{\gamma_1}) and standardized fifth (\eqn{\gamma_3}) and sixth
#'     (\eqn{\gamma_4}) cumulants, is larger than with Fleishman's method (see \code{\link[SimMultiCorrData]{calc_lower_skurt}}).
#'     For example, the Fleishman method can not be used to generate a non-normal distribution with a ratio of
#'     \eqn{\gamma_1^2/\gamma_2 > 9/14} (see Headrick & Kowalchuk, 2007).  This eliminates the Chi-squared family of distributions, which has
#'     a constant ratio of \eqn{\gamma_1^2/\gamma_2 = 2/3}.  The fifth-order method also generates more distributions with valid PDF's.
#'     However, if the fifth and sixth cumulants are unknown or do not exist, the Fleishman approximation should be used.
#'
#' @section Reasons for Function Errors:
#'     1) The most likely cause for function errors is that no solutions to \code{\link[SimMultiCorrData]{fleish}} or
#'     \code{\link[SimMultiCorrData]{poly}} converged when using \code{\link[SimMultiCorrData]{find_constants}}.  If this happens,
#'     the simulation will stop.  It may help to first use \code{\link[SimMultiCorrData]{find_constants}} for each continuous variable to
#'     determine if a sixth cumulant correction value is needed.  The solutions can be used as starting values (see \code{cstart} below).
#'     If the standardized cumulants are obtained from \code{calc_theory}, the user may need to use rounded values as inputs (i.e.
#'     \code{skews = round(skews, 8)}).  For example, in order to ensure that skew is exactly 0 for symmetric distributions.
#'
#'     2) The kurtosis may be outside the region of possible values.  There is an associated lower boundary for kurtosis associated
#'     with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method).  Use
#'     \code{\link[SimMultiCorrData]{calc_lower_skurt}} to determine the boundary for a given set of cumulants.
#'
#'     3) The feasibility of the final correlation matrix \code{rho}, given the
#'     distribution parameters, should be checked first using \code{\link[SimCorrMix]{validcorr2}}.  This function either checks
#'     if a given \code{rho} is plausible or returns the lower and upper final correlation limits.  It should be noted that even if a target
#'     correlation matrix is within the "plausible range," it still may not be possible to achieve the desired matrix.  This happens most
#'     frequently when generating ordinal variables or using negative correlations.  The error loop frequently fixes these problems.
#'
#' @param n the sample size (i.e. the length of each simulated variable; default = 10000)
#' @param k_cat the number of ordinal (r >= 2 categories) variables (default = 0)
#' @param k_cont the number of continuous non-mixture variables (default = 0)
#' @param k_mix the number of continuous mixture variables (default = 0)
#' @param k_pois the number of regular Poisson and zero-inflated Poisson variables (default = 0)
#' @param k_nb the number of regular Negative Binomial and zero-inflated Negative Binomial variables (default = 0)
#' @param method the method used to generate the \code{k_cont} non-mixture and \code{k_mix} mixture continuous variables.
#'     "Fleishman" uses Fleishman's third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation.
#' @param means a vector of means for the \code{k_cont} non-mixture and \code{k_mix} mixture continuous variables
#'     (i.e. \code{rep(0, (k_cont + k_mix))})
#' @param vars a vector of variances for the \code{k_cont} non-mixture and \code{k_mix} mixture continuous variables
#'     (i.e. \code{rep(1, (k_cont + k_mix))})
#' @param skews a vector of skewness values for the \code{k_cont} non-mixture continuous variables
#' @param skurts a vector of standardized kurtoses (kurtosis - 3, so that normal variables have a value of 0)
#'     for the \code{k_cont} non-mixture continuous variables
#' @param fifths a vector of standardized fifth cumulants for the \code{k_cont} non-mixture continuous variables
#'     (not necessary for \code{method} = "Fleishman")
#' @param sixths a vector of standardized sixth cumulants for the \code{k_cont} non-mixture continuous variables
#'     (not necessary for \code{method} = "Fleishman")
#' @param Six a list of vectors of sixth cumulant correction values for the \code{k_cont} non-mixture continuous variables
#'     if no valid PDF constants are found, \cr
#'     ex: \code{Six = list(seq(0.01, 2, 0.01), seq(1, 10, 0.5))}; if no correction is desired for \eqn{Y_{cont_i}}, set the i-th
#'     list component equal to \code{NULL}; if no correction is desired for any of the \eqn{Y_{cont}} keep as \code{Six = list()}
#'     (not necessary for \code{method} = "Fleishman")
#' @param mix_pis a list of length \code{k_mix} with i-th component a vector of mixing probabilities that sum to 1 for component distributions of \eqn{Y_{mix_i}}
#' @param mix_mus a list of length \code{k_mix} with i-th component a vector of means for component distributions of \eqn{Y_{mix_i}}
#' @param mix_sigmas a list of length \code{k_mix} with i-th component a vector of standard deviations for component distributions of \eqn{Y_{mix_i}}
#' @param mix_skews a list of length \code{k_mix} with i-th component a vector of skew values for component distributions of \eqn{Y_{mix_i}}
#' @param mix_skurts a list of length \code{k_mix} with i-th component a vector of standardized kurtoses for component distributions of \eqn{Y_{mix_i}}
#' @param mix_fifths a list of length \code{k_mix} with i-th component a vector of standardized fifth cumulants for component distributions of \eqn{Y_{mix_i}}
#'     (not necessary for \code{method} = "Fleishman")
#' @param mix_sixths a list of length \code{k_mix} with i-th component a vector of standardized sixth cumulants for component distributions of \eqn{Y_{mix_i}}
#'     (not necessary for \code{method} = "Fleishman")
#' @param mix_Six a list of length \code{k_mix} with i-th component a list of vectors of sixth cumulant correction values
#'     for component distributions of \eqn{Y_{mix_i}}; use \code{NULL} if no correction is desired for a given component or
#'     mixture variable; if no correction is desired for any of the \eqn{Y_{mix}} keep as \code{mix_Six = list()}
#'     (not necessary for \code{method} = "Fleishman")
#' @param marginal a list of length equal to \code{k_cat}; the i-th element is a vector of the cumulative
#'     probabilities defining the marginal distribution of the i-th variable;
#'     if the variable can take r values, the vector will contain r - 1 probabilities (the r-th is assumed to be 1);
#'     for binary variables, these should be input the same as for ordinal variables with more than 2 categories (i.e. the user-specified
#'     probability is the probability of the 1st category, which has the smaller support value)
#' @param support a list of length equal to \code{k_cat}; the i-th element is a vector containing the r ordered support values;
#'     if not provided (i.e. \code{support = list()}), the default is for the i-th element to be the vector 1, ..., r
#' @param lam a vector of lambda (mean > 0) constants for the Poisson variables (see \code{stats::dpois}); the order should be
#'     1st regular Poisson variables, 2nd zero-inflated Poisson variables
#' @param p_zip a vector of probabilities of structural zeros (not including zeros from the Poisson distribution) for the
#'     zero-inflated Poisson variables (see \code{VGAM::dzipois}); if \code{p_zip} = 0, \eqn{Y_{pois}} has a regular Poisson
#'     distribution; if \code{p_zip} is in (0, 1), \eqn{Y_{pois}} has a zero-inflated Poisson distribution;
#'     if \code{p_zip} is in \code{(-(exp(lam) - 1)^(-1), 0)}, \eqn{Y_{pois}} has a zero-deflated Poisson distribution and \code{p_zip}
#'     is not a probability; if \code{p_zip = -(exp(lam) - 1)^(-1)}, \eqn{Y_{pois}} has a positive-Poisson distribution
#'     (see \code{VGAM::dpospois}); if \code{length(p_zip) < length(lam)}, the missing values are set to 0 (and ordered 1st)
#' @param size a vector of size parameters for the Negative Binomial variables (see \code{stats::dnbinom}); the order should be
#'     1st regular NB variables, 2nd zero-inflated NB variables
#' @param prob a vector of success probability parameters for the NB variables; order the same as in \code{size}
#' @param mu a vector of mean parameters for the NB variables (*Note: either \code{prob} or \code{mu} should be supplied for all Negative Binomial variables,
#'     not a mixture; default = NULL); order the same as in \code{size}; for zero-inflated NB this refers to
#'     the mean of the NB distribution (see \code{VGAM::dzinegbin})
#' @param p_zinb a vector of probabilities of structural zeros (not including zeros from the NB distribution) for the zero-inflated NB variables
#'     (see \code{VGAM::dzinegbin}); if \code{p_zinb} = 0, \eqn{Y_{nb}} has a regular NB distribution;
#'     if \code{p_zinb} is in \code{(-prob^size/(1 - prob^size),} \code{0)}, \eqn{Y_{nb}} has a zero-deflated NB distribution and \code{p_zinb}
#'     is not a probability; if \code{p_zinb = -prob^size/(1 - prob^size)}, \eqn{Y_{nb}} has a positive-NB distribution (see
#'     \code{VGAM::dposnegbin}); if \code{length(p_zinb) < length(size)}, the missing values are set to 0 (and ordered 1st)
#' @param pois_eps a vector of length \code{k_pois} containing total cumulative probability truncation values; if none are provided,
#'     the default is 0.0001 for each variable
#' @param nb_eps a vector of length \code{k_nb} containing total cumulative probability truncation values; if none are provided,
#'     the default is 0.0001 for each variable
#' @param rho the target correlation matrix which must be ordered
#'     \emph{1st ordinal, 2nd continuous non-mixture, 3rd components of continuous mixtures, 4th regular Poisson, 5th zero-inflated Poisson,
#'     6th regular NB, 7th zero-inflated NB}; note that \code{rho} is specified in terms of the components of \eqn{Y_{mix}}
#' @param seed the seed value for random number generation (default = 1234)
#' @param errorloop if TRUE, uses \code{\link[SimCorrMix]{corr_error}} to attempt to correct final pairwise correlations to be within
#'     \code{epsilon} of target pairwise correlations (default = FALSE)
#' @param epsilon the maximum acceptable error between the final and target pairwise correlations (default = 0.001)
#'     in the calculation of ordinal intermediate correlations with \code{\link[SimCorrMix]{ord_norm}} or in the error loop
#' @param maxit the maximum number of iterations to use (default = 1000) in the calculation of ordinal
#'     intermediate correlations with \code{\link[SimCorrMix]{ord_norm}} or in the error loop
#' @param use.nearPD TRUE to convert the overall intermediate correlation matrix to the nearest positive definite matrix with \code{Matrix::nearPD} if
#'     necessary; if FALSE the negative eigenvalues are replaced with 0 if necessary
#' @param Sigma an intermediate correlation matrix to use if the user wants to provide one, else it is calculated within by
#'     \code{\link[SimCorrMix]{intercorr2}}
#' @param cstart a list of length equal to \code{k_cont} + the total number of mixture components containing initial values for root-solving
#'     algorithm used in \code{\link[SimMultiCorrData]{find_constants}}.  If user specified, each list element must be input as a matrix.
#'     For \code{method} = "Fleishman", each should have 3 columns for \eqn{c_1, c_2, c_3};
#'     for \code{method} = "Polynomial", each should have 5 columns for \eqn{c_1, c_2, c_3, c_4, c_5}.  If no starting values are specified for
#'     a given component, that list element should be \code{NULL}.
#' @param quiet if FALSE prints simulation messages, if TRUE suppresses message printing
#' @import SimMultiCorrData
#' @importFrom stats cor dbeta dbinom dchisq density dexp df dgamma dlnorm dlogis dmultinom dnbinom dnorm dpois dt dunif dweibull ecdf
#'     median pbeta pbinom pchisq pexp pf pgamma plnorm plogis pnbinom pnorm ppois pt punif pweibull qbeta qbinom qchisq qexp qf qgamma
#'     qlnorm qlogis qnbinom qnorm qpois qt quantile qunif qweibull rbeta rbinom rchisq rexp rf rgamma rlnorm rlogis rmultinom rnbinom
#'     rnorm rpois rt runif rweibull sd uniroot var
#' @import utils
#' @importFrom Matrix nearPD
#' @importFrom VGAM qzipois qzinegbin
#' @export
#' @keywords simulation continuous mixture ordinal Poisson NegativeBinomial Fleishman Headrick method2
#' @seealso \code{\link[SimMultiCorrData]{find_constants}}, \code{\link[SimCorrMix]{validpar}}, \code{\link[SimCorrMix]{validcorr2}},
#'     \code{\link[SimCorrMix]{intercorr2}}, \code{\link[SimCorrMix]{corr_error}}, \code{\link[SimCorrMix]{summary_var}}
#' @return A list whose components vary based on the type of simulated variables.
#' @return If \bold{ordinal variables} are produced: \code{Y_cat} the ordinal variables,
#'
#' @return If \bold{continuous variables} are produced:
#'
#'     \code{constants} a data.frame of the constants,
#'
#'     \code{Y_cont} the continuous non-mixture variables,
#'
#'     \code{Y_comp} the components of the continuous mixture variables,
#'
#'     \code{Y_mix} the continuous mixture variables,
#'
#'     \code{sixth_correction} a list of sixth cumulant correction values,
#'
#'     \code{valid.pdf} a vector where the i-th element is "TRUE" if the constants for the i-th continuous variable generate a valid PDF, else "FALSE"
#' @return If \bold{Poisson variables} are produced: \code{Y_pois} the regular and zero-inflated Poisson variables,
#'
#' @return If \bold{Negative Binomial variables} are produced: \code{Y_nb} the regular and zero-inflated Negative Binomial variables,
#'
#' @return Additionally, the following elements:
#'
#'     \code{Sigma} the intermediate correlation matrix (after the error loop),
#'
#'     \code{Error_Time} the time in minutes required to use the error loop,
#'
#'     \code{Time} the total simulation time in minutes,
#'
#'     \code{niter} a matrix of the number of iterations used for each variable in the error loop,
#'
#' @references
#' Barbiero A & Ferrari PA (2015). Simulation of correlated Poisson variables. Applied Stochastic Models in
#'     Business and Industry, 31:669-80. \doi{10.1002/asmb.2072}.
#'
#' Barbiero A & Ferrari PA (2015). GenOrd: Simulation of Discrete Random Variables with Given
#'     Correlation Matrix and Marginal Distributions. R package version 1.4.0. \cr \url{https://CRAN.R-project.org/package=GenOrd}
#'
#' Davenport JW, Bezder JC, & Hathaway RJ (1988). Parameter Estimation for Finite Mixture Distributions.
#'     Computers & Mathematics with Applications, 15(10):819-28.
#'
#' Demirtas H (2006). A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical
#'     Computation and Simulation, 76(11):1017-1025. \cr \doi{10.1080/10629360600569246}.
#'
#' Demirtas H (2014). Joint Generation of Binary and Nonnormal Continuous Data. Biometrics & Biostatistics, S12.
#'
#' Demirtas H, Hedeker D, & Mermelstein RJ (2012). Simulation of massive public health data by power polynomials.
#'     Statistics in Medicine, 31(27):3337-3346. \doi{10.1002/sim.5362}.
#'
#' Everitt BS (1996). An Introduction to Finite Mixture Distributions. Statistical Methods in Medical Research, 5(2):107-127. \doi{10.1177/096228029600500202}.
#'
#' Ferrari PA & Barbiero A (2012). Simulating ordinal data. Multivariate Behavioral Research, 47(4): 566-589.
#'     \doi{10.1080/00273171.2012.692630}.
#'
#' Fialkowski AC (2018). SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. R package version 0.2.2.
#'     \url{https://CRAN.R-project.org/package=SimMultiCorrData}.
#'
#' Fleishman AI (1978). A Method for Simulating Non-normal Distributions. Psychometrika, 43:521-532. \doi{10.1007/BF02293811}.
#'
#' Headrick TC (2002). Fast Fifth-order Polynomial Transforms for Generating Univariate and Multivariate
#'     Non-normal Distributions. Computational Statistics & Data Analysis, 40(4):685-711. \doi{10.1016/S0167-9473(02)00072-5}.
#'     (\href{http://www.sciencedirect.com/science/article/pii/S0167947302000725}{ScienceDirect})
#'
#' Headrick TC (2004). On Polynomial Transformations for Simulating Multivariate Nonnormal Distributions.
#'     Journal of Modern Applied Statistical Methods, 3(1):65-71. \doi{10.22237/jmasm/1083370080}.
#'
#' Headrick TC, Kowalchuk RK (2007). The Power Method Transformation: Its Probability Density Function, Distribution
#'     Function, and Its Further Use for Fitting Data. Journal of Statistical Computation and Simulation, 77:229-249. \doi{10.1080/10629360600605065}.
#'
#' Headrick TC, Sawilowsky SS (1999). Simulating Correlated Non-normal Distributions: Extending the Fleishman Power
#'     Method. Psychometrika, 64:25-35. \doi{10.1007/BF02294317}.
#'
#' Headrick TC, Sheng Y, & Hodis FA (2007). Numerical Computing and Graphics for the Power Method Transformation Using
#'     Mathematica. Journal of Statistical Software, 19(3):1 - 17. \cr \doi{10.18637/jss.v019.i03}.
#'
#' Higham N (2002). Computing the nearest correlation matrix - a problem from finance; IMA Journal of Numerical Analysis 22:329-343.
#'
#' Ismail N & Zamani H (2013). Estimation of Claim Count Data Using Negative Binomial, Generalized Poisson, Zero-Inflated Negative Binomial and
#'     Zero-Inflated Generalized Poisson Regression Models. Casualty Actuarial Society E-Forum 41(20):1-28.
#'
#' Lambert D (1992). Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics 34(1):1-14.
#'
#' Olsson U, Drasgow F, & Dorans NJ (1982). The Polyserial Correlation Coefficient. Psychometrika, 47(3):337-47.
#'     \doi{10.1007/BF02294164}.
#'
#' Pearson RK (2011). Exploring Data in Engineering, the Sciences, and Medicine. In. New York: Oxford University Press.
#'
#' Schork NJ, Allison DB, & Thiel B (1996). Mixture Distributions in Human Genetics Research. Statistical Methods in Medical Research,
#'     5:155-178. \doi{10.1177/096228029600500204}.
#'
#' Vale CD & Maurelli VA (1983). Simulating Multivariate Nonnormal Distributions. Psychometrika, 48:465-471. \doi{10.1007/BF02293687}.
#'
#' Yee TW (2018). VGAM: Vector Generalized Linear and Additive Models. R package version 1.0-5. \url{https://CRAN.R-project.org/package=VGAM}.
#'
#' Zhang X, Mallick H, & Yi N (2016). Zero-Inflated Negative Binomial Regression for Differential Abundance Testing in Microbiome
#'     Studies. Journal of Bioinformatics and Genomics 2(2):1-9. \doi{10.18454/jbg.2016.2.2.1}.
#'
#' @examples
#' Sim1 <- corrvar2(n = 1000, k_cat = 1, k_cont = 1, method = "Polynomial",
#'   means = 0, vars = 1, skews = 0, skurts = 0, fifths = 0, sixths = 0,
#'   marginal = list(c(1/3, 2/3)), support = list(0:2),
#'   rho = matrix(c(1, 0.4, 0.4, 1), 2, 2), quiet = TRUE)
#'
#' \dontrun{
#'
#' # 2 continuous mixture, 1 binary, 1 zero-inflated Poisson, and
#' # 1 zero-inflated NB variable
#' n <- 10000
#' seed <- 1234
#'
#' # Mixture variables: Normal mixture with 2 components;
#' # mixture of Logistic(0, 1), Chisq(4), Beta(4, 1.5)
#' # Find cumulants of components of 2nd mixture variable
#' L <- calc_theory("Logistic", c(0, 1))
#' C <- calc_theory("Chisq", 4)
#' B <- calc_theory("Beta", c(4, 1.5))
#'
#' skews <- skurts <- fifths <- sixths <- NULL
#' Six <- list()
#' mix_pis <- list(c(0.4, 0.6), c(0.3, 0.2, 0.5))
#' mix_mus <- list(c(-2, 2), c(L[1], C[1], B[1]))
#' mix_sigmas <- list(c(1, 1), c(L[2], C[2], B[2]))
#' mix_skews <- list(rep(0, 2), c(L[3], C[3], B[3]))
#' mix_skurts <- list(rep(0, 2), c(L[4], C[4], B[4]))
#' mix_fifths <- list(rep(0, 2), c(L[5], C[5], B[5]))
#' mix_sixths <- list(rep(0, 2), c(L[6], C[6], B[6]))
#' mix_Six <- list(list(NULL, NULL), list(1.75, NULL, 0.03))
#' Nstcum <- calc_mixmoments(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]],
#'   mix_skews[[1]], mix_skurts[[1]], mix_fifths[[1]], mix_sixths[[1]])
#' Mstcum <- calc_mixmoments(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]],
#'   mix_skews[[2]], mix_skurts[[2]], mix_fifths[[2]], mix_sixths[[2]])
#' means <- c(Nstcum[1], Mstcum[1])
#' vars <- c(Nstcum[2]^2, Mstcum[2]^2)
#'
#' marginal <- list(0.3)
#' support <- list(c(0, 1))
#' lam <- 0.5
#' p_zip <- 0.1
#' pois_eps <- 0.0001
#' size <- 2
#' prob <- 0.75
#' p_zinb <- 0.2
#' nb_eps <- 0.0001
#'
#' k_cat <- k_pois <- k_nb <- 1
#' k_cont <- 0
#' k_mix <- 2
#' Rey <- matrix(0.39, 8, 8)
#' diag(Rey) <- 1
#' rownames(Rey) <- colnames(Rey) <- c("O1", "M1_1", "M1_2", "M2_1", "M2_2",
#'   "M2_3", "P1", "NB1")
#'
#' # set correlation between components of the same mixture variable to 0
#' Rey["M1_1", "M1_2"] <- Rey["M1_2", "M1_1"] <- 0
#' Rey["M2_1", "M2_2"] <- Rey["M2_2", "M2_1"] <- Rey["M2_1", "M2_3"] <- 0
#' Rey["M2_3", "M2_1"] <- Rey["M2_2", "M2_3"] <- Rey["M2_3", "M2_2"] <- 0
#'
#' # check parameter inputs
#' validpar(k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
#'   vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
#'   mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal, support,
#'   lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey)
#'
#' # check to make sure Rey is within the feasible correlation boundaries
#' validcorr2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
#'   vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
#'   mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal,
#'   lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey, seed)
#'
#' # simulate without the error loop
#' Sim2 <- corrvar2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
#'   vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
#'   mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal, support,
#'   lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey, seed,
#'   epsilon = 0.01)
#'
#' names(Sim2)
#'
#' # simulate with the error loop
#' Sim2_EL <- corrvar2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial",
#'   means, vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus,
#'   mix_sigmas, mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six,
#'   marginal, support, lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps,
#'   nb_eps, Rey, seed, errorloop = TRUE, epsilon = 0.01)
#'
#' names(Sim2_EL)
#' }
corrvar2 <- function(n = 10000, k_cat = 0, k_cont = 0, k_mix = 0, k_pois = 0,
                     k_nb = 0, method = c("Fleishman", "Polynomial"),
                     means =  NULL, vars =  NULL, skews =  NULL,
                     skurts =  NULL, fifths =  NULL, sixths =  NULL,
                     Six = list(), mix_pis = list(), mix_mus = list(),
                     mix_sigmas = list(), mix_skews =  list(),
                     mix_skurts =  list(), mix_fifths =  list(),
                     mix_sixths =  list(), mix_Six = list(), marginal = list(),
                     support = list(), lam  =  NULL, p_zip = 0, size = NULL,
                     prob = NULL, mu = NULL, p_zinb = 0, pois_eps = 0.0001,
                     nb_eps = 0.0001, rho = NULL, seed = 1234,
                     errorloop = FALSE, epsilon = 0.001, maxit = 1000,
                     use.nearPD = TRUE, Sigma = NULL, cstart = list(),
                     quiet = FALSE) {
  start.time <- Sys.time()
  k <- k_cat + k_cont + k_mix + k_pois + k_nb
  if (k_pois > 0) {
    if (length(p_zip) < k_pois)
      p_zip <- c(rep(0, k_pois - length(p_zip)), p_zip)
    if (length(pois_eps) < k_pois)
      pois_eps <- rep(0.0001, k_pois)
  }
  if (k_nb > 0) {
    if (length(prob) > 0)
      mu <- size * (1 - prob)/prob
    if (length(p_zinb) < k_nb)
      p_zinb <- c(rep(0, k_nb - length(p_zinb)), p_zinb)
    if (length(nb_eps) < k_nb)
      nb_eps <- rep(0.0001, k_nb)
  }
  if (is.null(means) & (k_cont + k_mix) > 0) {
    means <- rep(0, k_cont + k_mix)
  }
  if (is.null(vars) & (k_cont + k_mix) > 0) {
    vars <- rep(1, k_cont + k_mix)
  }
  csame.dist <- NULL
  msame.dist <- NULL
  if (length(skews) >= 2) {
    for (i in 2:length(skews)) {
      if (skews[i] %in% skews[1:(i - 1)]) {
        csame <- which(skews[1:(i - 1)] == skews[i])
        for (j in 1:length(csame)) {
          if (method == "Polynomial") {
            if ((skurts[i] == skurts[csame[j]]) &
                (fifths[i] == fifths[csame[j]]) &
                (sixths[i] == sixths[csame[j]])) {
              csame.dist <- rbind(csame.dist, c(csame[j], i))
              break
            }
          }
          if (method == "Fleishman") {
            if (skurts[i] == skurts[csame[j]]) {
              csame.dist <- rbind(csame.dist, c(csame[j], i))
              break
            }
          }
        }
      }
    }
  }
  mix_skews2 <- NULL
  mix_skurts2 <- NULL
  mix_fifths2 <- NULL
  mix_sixths2 <- NULL
  if (length(mix_pis) >= 1) {
    k.comp <- c(0, cumsum(unlist(lapply(mix_pis, length))))
    mix_mus2 <- unlist(mix_mus)
    mix_sigmas2 <- unlist(mix_sigmas)
    mix_skews2 <- unlist(mix_skews)
    mix_skurts2 <- unlist(mix_skurts)
    mix_Six2 <- list()
    if (method == "Polynomial") {
      mix_fifths2 <- unlist(mix_fifths)
      mix_sixths2 <- unlist(mix_sixths)
      if (length(mix_Six) > 0) {
        if (class(mix_Six[[1]]) == "numeric") mix_Six2 <- mix_Six
        if (class(mix_Six[[1]]) == "list") mix_Six2 <- do.call(append, mix_Six)
      }
    }
    for (i in 1:length(mix_skews2)) {
      msame.dist2 <- NULL
      if (length(skews) >= 1) {
        if (mix_skews2[i] %in% skews) {
          msame <- which(skews == mix_skews2[i])
          for (j in 1:length(msame)) {
            if (method == "Polynomial") {
              if ((mix_skurts2[i] == skurts[msame[j]]) &
                  (mix_fifths2[i] == fifths[msame[j]]) &
                  (mix_sixths2[i] == sixths[msame[j]])) {
                msame.dist2 <- c(msame[j], i)
                break
              }
            }
            if (method == "Fleishman") {
              if (mix_skurts2[i] == skurts[msame[j]]) {
                msame.dist2 <- c(msame[j], i)
                break
              }
            }
          }
        }
      }
      if (is.null(msame.dist2) & i >= 2) {
        if (mix_skews2[i] %in% mix_skews2[1:(i-1)]) {
          msame <- which(mix_skews2[1:(i - 1)] == mix_skews2[i])
          for (j in 1:length(msame)) {
            if (method == "Polynomial") {
              if ((mix_skurts2[i] == mix_skurts2[msame[j]]) &
                  (mix_fifths2[i] == mix_fifths2[msame[j]]) &
                  (mix_sixths2[i] == mix_sixths2[msame[j]])) {
                msame.dist2 <- c(k_cont + msame[j], i)
                break
              }
            }
            if (method == "Fleishman") {
              if (mix_skurts2[i] == mix_skurts2[msame[j]]) {
                msame.dist2 <- c(k_cont + msame[j], i)
                break
              }
            }
          }
        }
      }
      msame.dist <- rbind(msame.dist, msame.dist2)
    }
  }
  if ((k_cont + k_mix) >= 1) {
    SixCorr <- numeric(k_cont + length(mix_skews2))
    Valid.PDF <- numeric(k_cont + length(mix_skews2))
    if (method == "Fleishman") {
      constants <- matrix(NA, nrow = (k_cont + length(mix_skews2)), ncol = 4)
      colnames(constants) <- c("c0", "c1", "c2", "c3")
    }
    if (method == "Polynomial") {
      constants <- matrix(NA, nrow = (k_cont + length(mix_skews2)), ncol = 6)
      colnames(constants) <- c("c0", "c1", "c2", "c3", "c4", "c5")
    }
  }
  if (k_cont >= 1) {
    for (i in 1:k_cont) {
      if (!is.null(csame.dist)) {
        rind <- which(csame.dist[, 2] == i)
        if (length(rind) > 0) {
          constants[i, ] <- constants[csame.dist[rind, 1], ]
          SixCorr[i] <- SixCorr[csame.dist[rind, 1]]
          Valid.PDF[i] <- Valid.PDF[csame.dist[rind, 1]]
        }
      }
      if (sum(is.na(constants[i, ])) > 0) {
        if (length(Six) == 0) Six2 <- NULL else
          Six2 <- Six[[i]]
        if (length(cstart) == 0) cstart2 <- NULL else
          cstart2 <- cstart[[i]]
        cons <-
          suppressWarnings(find_constants(method = method, skews = skews[i],
            skurts = skurts[i], fifths = fifths[i], sixths = sixths[i],
            Six = Six2, cstart = cstart2, n = 25, seed = seed))
        if (length(cons) == 1 | is.null(cons)) {
          stop(paste("Constants can not be found for continuous variable ", i,
                     ".", sep = ""))
        }
        con_solution <- cons$constants
        SixCorr[i] <- ifelse(is.null(cons$SixCorr1), NA, cons$SixCorr1)
        Valid.PDF[i] <- cons$valid
        constants[i, ] <- con_solution
      }
    }
  }
  if (k_mix >= 1) {
    for (i in 1:length(mix_skews2)) {
      if (!is.null(msame.dist)) {
        rind <- which(msame.dist[, 2] == i)
        if (length(rind) > 0) {
          constants[(k_cont + i), ] <- constants[msame.dist[rind, 1], ]
          SixCorr[k_cont + i] <- SixCorr[msame.dist[rind, 1]]
          Valid.PDF[k_cont + i] <- Valid.PDF[msame.dist[rind, 1]]
        }
      }
      if (sum(is.na(constants[(k_cont + i), ])) > 0) {
        if (length(mix_Six2) == 0) Six2 <- NULL else
          Six2 <- mix_Six2[[i]]
        if (length(cstart) == 0) cstart2 <- NULL else
          cstart2 <- cstart[[k_cont + i]]
        cons <-
          suppressWarnings(find_constants(method = method,
            skews = mix_skews2[i], skurts = mix_skurts2[i],
            fifths = mix_fifths2[i], sixths = mix_sixths2[i], Six = Six2,
            cstart = cstart2, n = 25, seed = seed))
        if (length(cons) == 1 | is.null(cons)) {
          stop(paste("Constants can not be found for component variable ", i,
                     ".", sep = ""))
        }
        con_solution <- cons$constants
        SixCorr[k_cont + i] <- ifelse(is.null(cons$SixCorr1), NA,
                                      cons$SixCorr1)
        Valid.PDF[k_cont + i] <- cons$valid
        constants[(k_cont + i), ] <- con_solution
      }
    }
  }
  if (k_cat > 0) {
    if (length(support) == 0) {
      support <- lapply(marginal, function(x) 1:(length(x) + 1))
    } else {
      for (i in 1:k_cat) {
        if (length(support[[i]]) != (length(marginal[[i]]) + 1))
          support[[i]] <- 1:(length(marginal[[i]]) + 1)
      }
    }
  }
  if (is.null(Sigma)) {
    Sigma <- intercorr2(k_cat = k_cat, k_cont = k_cont + length(mix_skews2),
      k_pois = k_pois, k_nb = k_nb, method = method, constants = constants,
      marginal = marginal, support = support, lam = lam, p_zip = p_zip,
      size = size, mu = mu, p_zinb = p_zinb, pois_eps = pois_eps,
      nb_eps = nb_eps, rho = rho, epsilon = epsilon, maxit = maxit,
      quiet = quiet)
  }
  if (min(eigen(Sigma, symmetric = TRUE)$values) < 0) {
    if (use.nearPD == TRUE) {
      Sigma <- as.matrix(nearPD(Sigma, corr = T, keepDiag = T)$mat)
      if (quiet == FALSE)
        message("Intermediate correlation matrix is not positive definite.
Nearest positive definite matrix is used.")
    } else if (quiet == FALSE) {
      message("Intermediate correlation matrix is not positive definite.
Negative eigenvalues are replaced with 0.  Set use.nearPD = TRUE to use nearest
positive-definite matrix instead.")
    }
  }
  eig <- eigen(Sigma, symmetric = TRUE)
  sqrteigval <- diag(sqrt(pmax(eig$values, 0)), nrow = nrow(Sigma),
                     ncol = ncol(Sigma))
  eigvec <- eig$vectors
  fry <- eigvec %*% sqrteigval
  set.seed(seed)
  X <- matrix(rnorm(ncol(Sigma) * n), n)
  X <- scale(X, TRUE, FALSE)
  X <- X %*% svd(X, nu = 0)$v
  X <- scale(X, FALSE, TRUE)
  X <- fry %*% t(X)
  X <- t(X)
  if (k_cat > 0) {
    X_cat <- X[, 1:k_cat, drop = FALSE]
  }
  if ((k_cont + length(mix_skews2)) > 0) {
    X_cont <- X[, (k_cat + 1):(k_cat + k_cont + length(mix_skews2)),
                drop = FALSE]
  }
  if (k_pois > 0) {
    X_pois <- X[, (k_cat + k_cont + length(mix_skews2) + 1):(k_cat + k_cont +
                length(mix_skews2) + k_pois), drop = FALSE]
  }
  if (k_nb > 0) {
    X_nb <- X[, (ncol(X) - k_nb + 1):ncol(X), drop = FALSE]
  }
  Y_cat <- NULL
  Y_cont <- NULL
  Y_mix <- NULL
  Y_pois <- NULL
  Y_nb <- NULL
  means2 <- NULL
  vars2 <- NULL
  if (k_cat > 0) {
    Y_cat <- matrix(1, nrow = n, ncol = k_cat)
    for (i in 1:length(marginal)) {
      Y_cat[, i] <- as.integer(cut(X_cat[, i], breaks = c(min(X_cat[, i]) - 1,
                                qnorm(marginal[[i]]), max(X_cat[, i])  +  1)))
      Y_cat[, i] <- support[[i]][Y_cat[, i]]
    }
  }
  if ((k_cont + length(mix_skews2)) > 0) {
    Y <- matrix(1, nrow = n, ncol = k_cont + length(mix_skews2))
    for (i in 1:(k_cont + length(mix_skews2))) {
      if (method == "Fleishman") {
        Y[, i] <- constants[i, 1] + constants[i, 2] * X_cont[, i] +
          constants[i, 3] * X_cont[, i]^2 + constants[i, 4] * X_cont[, i]^3
      }
      if (method == "Polynomial") {
        Y[, i] <- constants[i, 1] + constants[i, 2] * X_cont[, i] +
          constants[i, 3] * X_cont[, i]^2 + constants[i, 4] * X_cont[, i]^3 +
          constants[i, 5] * X_cont[, i]^4 + constants[i, 6] * X_cont[, i]^5
      }
    }
    if (k_cont > 0) {
      Y_cont <- matrix(1, n, k_cont)
      means2 <- means[1:k_cont]
      vars2 <- vars[1:k_cont]
      for (i in 1:k_cont) {
        Y_cont[, i] <- means2[i] + sqrt(vars2[i]) * Y[, i]
      }
    }
    if (length(mix_skews2) > 0) {
      Y_mix <- matrix(1, n, length(mix_skews2))
      means2 <- c(means2, mix_mus2)
      vars2 <- c(vars2, mix_sigmas2^2)
      for (i in 1:length(mix_skews2)) {
        Y_mix[, i] <- mix_mus2[i] + mix_sigmas2[i] * Y[, (k_cont + i)]
      }
    }
  }
  if (k_pois > 0) {
    Y_pois <- matrix(1, nrow = n, ncol = k_pois)
    for (i in 1:k_pois) {
      Y_pois[, i] <- qzipois(pnorm(X_pois[, i]), lam[i], pstr0 = p_zip[i])
    }
  }
  if (k_nb > 0) {
    Y_nb <- matrix(1, nrow = n, ncol = k_nb)
    for (i in 1:k_nb) {
      Y_nb[, i] <- qzinegbin(pnorm(X_nb[, i]), size = size[i], munb = mu[i],
                             pstr0 = p_zinb[i])
    }
  }
  start.time.error <- Sys.time()
  niter <- diag(0, k, k)
  rho_calc <- cor(cbind(Y_cat, Y_cont, Y_mix, Y_pois, Y_nb))
  emax <- max(abs(rho_calc - rho))
  if (emax > epsilon & errorloop == TRUE) {
    EL <- corr_error(n = n, k_cat = k_cat,
      k_cont = k_cont + length(mix_skews2), k_pois = k_pois, k_nb = k_nb,
      method = method, means = means2, vars = vars2, constants = constants,
      marginal = marginal, support = support, lam = lam, p_zip = p_zip,
      size = size, mu = mu, p_zinb = p_zinb, seed = seed,
      epsilon = epsilon, maxit = maxit, rho0 = rho, Sigma = Sigma,
      rho_calc = rho_calc)
    Sigma <- EL$Sigma
    Y_cat <- EL$Y_cat
    Y <- EL$Y
    if (k_cont > 0) Y_cont <- EL$Y_cont[, 1:k_cont, drop = FALSE]
    if (k_mix > 0)
      Y_mix <- EL$Y_cont[, (ncol(EL$Y_cont) - length(mix_skews2) +
                              1):ncol(EL$Y_cont), drop = FALSE]
    Y_pois <- EL$Y_pois
    Y_nb <- EL$Y_nb
    niter <- EL$niter
  }
  stop.time.error <- Sys.time()
  niter <- as.data.frame(niter)
  rownames(niter) <- c(1:ncol(niter))
  colnames(niter) <- c(1:ncol(niter))
  Y_mix2 <- NULL
  if (k_mix > 0) {
    Y_mix2 <- matrix(1, n, k_mix)
    for (i in 1:k_mix) {
      set.seed(seed)
      M <- rmultinom(n, size = 1, prob = mix_pis[[i]])
      Y_mix2[, i] <- apply(t(M) * Y_mix[, (k.comp[i] + 1):k.comp[i + 1]], 1,
                           sum)
      Y_mix2[, i] <- scale(Y_mix2[, i])
      Y_mix2[, i] <- means[k_cont + i] + sqrt(vars[k_cont + i]) * Y_mix2[, i]
      seed <- seed + 1
    }
  }
  result <- list()
  if (k_cat > 0) result <- append(result, list(Y_cat = Y_cat))
  if ((k_cont + k_mix) > 0) {
    result <- append(result, list(constants = as.data.frame(constants),
                                  Y_cont = Y_cont, Y_comp = Y_mix, sixth_correction = SixCorr,
                                  valid.pdf = Valid.PDF))
    if (k_mix > 0) result <- append(result, list(Y_mix = Y_mix2))
  }
  if (k_pois > 0) result <- append(result, list(Y_pois = Y_pois))
  if (k_nb > 0) result <- append(result, list(Y_nb = Y_nb))
  stop.time <- Sys.time()
  Time.error <- round(difftime(stop.time.error, start.time.error,
                               units = "min"), 3)
  Time <- round(difftime(stop.time, start.time, units = "min"), 3)
  if (quiet == FALSE) cat("Total Simulation time:", Time, "minutes \n")
  result <- append(result, list(Sigma = Sigma, Error_Time = Time.error,
                                Time = Time, niter = niter))
  result
}
Any scripts or data that you put into this service are public.
SimCorrMix documentation built on May 2, 2019, 1:24 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
SimCorrMix
Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions

R/corrvar2.R
In SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions

Defines functions corrvar2

Documented in corrvar2

Try the SimCorrMix package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

SimCorrMix Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions

R/corrvar2.R In SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions

Defines functions corrvar2

Documented in corrvar2

Try the SimCorrMix package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

SimCorrMix
Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions

R/corrvar2.R
In SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions