R/corrvar2.R

Defines functions corrvar2

Documented in corrvar2

#' @title Generation of Correlated Ordinal, Continuous (mixture and non-mixture), and/or Count (Poisson and Negative Binomial,
#'     regular and zero-inflated) Variables: Correlation Method 2
#'
#' @description This function simulates \code{k_cat} ordinal (\eqn{r \ge 2} categories), \code{k_cont} continuous non-mixture,
#'     \code{k_mix} continuous mixture, \code{k_pois} Poisson (regular and zero-inflated), and/or \code{k_nb} Negative Binomial
#'     (regular and zero-inflated) variables with a specified correlation matrix \code{rho}.  The variables are generated from
#'     multivariate normal variables with intermediate correlation matrix \code{Sigma}, calculated by \code{\link[SimCorrMix]{intercorr2}},
#'     and then transformed.  The intermediate correlations involving count variables are determined using \strong{correlation method 2}.
#'     The \emph{ordering} of the variables in \code{rho} must be 1st ordinal, 2nd continuous non-mixture,
#'     3rd components of the continuous mixture, 4th regular Poisson, 5th zero-inflated Poisson, 6th regular NB, and 7th zero-inflated NB.
#'     Note that it is possible for \code{k_cat}, \code{k_cont}, \code{k_mix}, \code{k_pois}, and/or \code{k_nb} to be 0.
#'     Simulation occurs at the component-level for continuous mixture distributions.  The target correlation matrix is specified in terms of
#'     correlations with components of continuous mixture variables.  There are no parameter input checks
#'     in order to decrease simulation time.  All inputs should be checked prior to simulation with \code{\link[SimCorrMix]{validpar}}
#'     and \code{\link[SimCorrMix]{validcorr2}}.  Summaries for the simulation results can be obtained with \code{\link[SimCorrMix]{summary_var}}.
#'
#'     All continuous variables are simulated using either Fleishman's third-order (\code{method} = "Fleishman", \doi{10.1007/BF02293811}) or Headrick's fifth-order
#'     (\code{method} = "Polynomial", \doi{10.1016/S0167-9473(02)00072-5}) power method transformation.  It works by matching standardized
#'     cumulants -- the first four (mean, variance, skew, and standardized kurtosis) for Fleishman's method, or the first six (mean,
#'     variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick's method.  The transformation is
#'     expressed as follows:
#'
#'     \deqn{Y = c_0 + c_1 * Z + c_2 * Z^2 + c_3 * Z^3 + c_4 * Z^4 + c_5 * Z^5,  Z \sim N(0,1),}
#'
#'     where \eqn{c_4} and \eqn{c_5} both equal \eqn{0} for Fleishman's method.  The real constants are calculated by \cr
#'     \code{\link[SimMultiCorrData]{find_constants}}.  Continuous mixture variables are generated componentwise and then transformed to
#'     the desired mixture variables based on random multinomial variables generated from the mixing probabilities.  Ordinal variables
#'     (\eqn{r \ge 2} categories) are generated by discretizing the standard normal
#'     variables at quantiles.  These quantiles are determined by evaluating the inverse standard normal CDF at the cumulative
#'     probabilities defined by each variable's marginal distribution.  Count variables are generated using the inverse CDF method.  The
#'     CDF of a standard normal variable has a uniform distribution.  The appropriate quantile function (F_Y)^(-1) is applied to
#'     this uniform variable with the designated parameters to generate the count variable: Y = (F_Y)^(-1)(Phi(Z)).  The Negative
#'     Binomial variable represents the number of failures which occur in a sequence of Bernoulli trials before the target number of
#'     successes is achieved.  Zero-inflated Poisson or NB variables are obtained by setting the probability of a structural zero to be
#'     greater than 0.  The optional error loop attempts to correct the final pairwise correlations to be within a user-specified
#'     precision value (\code{epsilon}) of the target correlations.
#'
#'     The vignette \bold{Variable Types} discusses how each of the different variables are generated and describes the required parameters.
#'
#'     The vignette \bold{Overall Workflow for Generation of Correlated Data} provides a detailed example discussing the step-by-step simulation process and comparing correlation methods 1 and 2.
#'
#' @section Overview of Method 2:
#'     The intermediate correlations used in method 2 are less simulation based than those in method 1, and no seed is needed.
#'     Their calculations involve greater utilization of correction loops which make iterative adjustments until a maximum error
#'     has been reached (if possible).  In addition, method 2 differs from method 1 in the following ways:
#'
#'     1) The intermediate correlations involving \bold{count variables} are based on the methods of Barbiero & Ferrari (2012,
#'     \doi{10.1080/00273171.2012.692630}, 2015, \doi{10.1002/asmb.2072}).
#'     The Poisson or Negative Binomial support is made finite by removing a small user-specified value (i.e. 1e-06) from the total
#'     cumulative probability.  This truncation factor may differ for each count variable.  The count variables are subsequently
#'     treated as ordinal and intermediate correlations are calculated using the correction loop of
#'     \code{\link[SimCorrMix]{ord_norm}}.
#'
#'     2) The \bold{continuous - count variable} correlations are based on an extension of the method of Demirtas et al. (2012,
#'     \doi{10.1002/sim.5362}), and the count
#'     variables are treated as ordinal.  The correction factor is the product of the power method correlation between the
#'     continuous variable and the normal variable used to generate it (see Headrick & Kowalchuk, 2007, \doi{10.1080/10629360600605065})
#'     and the point-polyserial correlation between the ordinalized count variable and the normal variable used to generate it (see Olsson et al., 1982,
#'     \doi{10.1007/BF02294164}).
#'     The intermediate correlations are the ratio of the target correlations to the correction factor.
#'
#'     Please see the \bold{Comparison of Correlation Methods 1 and 2} vignette for more information and a step-by-step overview of the simulation process.
#'
#' @section Choice of Fleishman's third-order or Headrick's fifth-order method:
#'     Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving
#'     accuracy.  In addition, the range of feasible standardized kurtosis (\eqn{\gamma_2}) values, given skew (\eqn{\gamma_1}) and standardized fifth (\eqn{\gamma_3}) and sixth
#'     (\eqn{\gamma_4}) cumulants, is larger than with Fleishman's method (see \code{\link[SimMultiCorrData]{calc_lower_skurt}}).
#'     For example, the Fleishman method can not be used to generate a non-normal distribution with a ratio of
#'     \eqn{\gamma_1^2/\gamma_2 > 9/14} (see Headrick & Kowalchuk, 2007).  This eliminates the Chi-squared family of distributions, which has
#'     a constant ratio of \eqn{\gamma_1^2/\gamma_2 = 2/3}.  The fifth-order method also generates more distributions with valid PDF's.
#'     However, if the fifth and sixth cumulants are unknown or do not exist, the Fleishman approximation should be used.
#'
#' @section Reasons for Function Errors:
#'     1) The most likely cause for function errors is that no solutions to \code{\link[SimMultiCorrData]{fleish}} or
#'     \code{\link[SimMultiCorrData]{poly}} converged when using \code{\link[SimMultiCorrData]{find_constants}}.  If this happens,
#'     the simulation will stop.  It may help to first use \code{\link[SimMultiCorrData]{find_constants}} for each continuous variable to
#'     determine if a sixth cumulant correction value is needed.  The solutions can be used as starting values (see \code{cstart} below).
#'     If the standardized cumulants are obtained from \code{calc_theory}, the user may need to use rounded values as inputs (i.e.
#'     \code{skews = round(skews, 8)}).  For example, in order to ensure that skew is exactly 0 for symmetric distributions.
#'
#'     2) The kurtosis may be outside the region of possible values.  There is an associated lower boundary for kurtosis associated
#'     with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method).  Use
#'     \code{\link[SimMultiCorrData]{calc_lower_skurt}} to determine the boundary for a given set of cumulants.
#'
#'     3) The feasibility of the final correlation matrix \code{rho}, given the
#'     distribution parameters, should be checked first using \code{\link[SimCorrMix]{validcorr2}}.  This function either checks
#'     if a given \code{rho} is plausible or returns the lower and upper final correlation limits.  It should be noted that even if a target
#'     correlation matrix is within the "plausible range," it still may not be possible to achieve the desired matrix.  This happens most
#'     frequently when generating ordinal variables or using negative correlations.  The error loop frequently fixes these problems.
#'
#' @param n the sample size (i.e. the length of each simulated variable; default = 10000)
#' @param k_cat the number of ordinal (r >= 2 categories) variables (default = 0)
#' @param k_cont the number of continuous non-mixture variables (default = 0)
#' @param k_mix the number of continuous mixture variables (default = 0)
#' @param k_pois the number of regular Poisson and zero-inflated Poisson variables (default = 0)
#' @param k_nb the number of regular Negative Binomial and zero-inflated Negative Binomial variables (default = 0)
#' @param method the method used to generate the \code{k_cont} non-mixture and \code{k_mix} mixture continuous variables.
#'     "Fleishman" uses Fleishman's third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation.
#' @param means a vector of means for the \code{k_cont} non-mixture and \code{k_mix} mixture continuous variables
#'     (i.e. \code{rep(0, (k_cont + k_mix))})
#' @param vars a vector of variances for the \code{k_cont} non-mixture and \code{k_mix} mixture continuous variables
#'     (i.e. \code{rep(1, (k_cont + k_mix))})
#' @param skews a vector of skewness values for the \code{k_cont} non-mixture continuous variables
#' @param skurts a vector of standardized kurtoses (kurtosis - 3, so that normal variables have a value of 0)
#'     for the \code{k_cont} non-mixture continuous variables
#' @param fifths a vector of standardized fifth cumulants for the \code{k_cont} non-mixture continuous variables
#'     (not necessary for \code{method} = "Fleishman")
#' @param sixths a vector of standardized sixth cumulants for the \code{k_cont} non-mixture continuous variables
#'     (not necessary for \code{method} = "Fleishman")
#' @param Six a list of vectors of sixth cumulant correction values for the \code{k_cont} non-mixture continuous variables
#'     if no valid PDF constants are found, \cr
#'     ex: \code{Six = list(seq(0.01, 2, 0.01), seq(1, 10, 0.5))}; if no correction is desired for \eqn{Y_{cont_i}}, set the i-th
#'     list component equal to \code{NULL}; if no correction is desired for any of the \eqn{Y_{cont}} keep as \code{Six = list()}
#'     (not necessary for \code{method} = "Fleishman")
#' @param mix_pis a list of length \code{k_mix} with i-th component a vector of mixing probabilities that sum to 1 for component distributions of \eqn{Y_{mix_i}}
#' @param mix_mus a list of length \code{k_mix} with i-th component a vector of means for component distributions of \eqn{Y_{mix_i}}
#' @param mix_sigmas a list of length \code{k_mix} with i-th component a vector of standard deviations for component distributions of \eqn{Y_{mix_i}}
#' @param mix_skews a list of length \code{k_mix} with i-th component a vector of skew values for component distributions of \eqn{Y_{mix_i}}
#' @param mix_skurts a list of length \code{k_mix} with i-th component a vector of standardized kurtoses for component distributions of \eqn{Y_{mix_i}}
#' @param mix_fifths a list of length \code{k_mix} with i-th component a vector of standardized fifth cumulants for component distributions of \eqn{Y_{mix_i}}
#'     (not necessary for \code{method} = "Fleishman")
#' @param mix_sixths a list of length \code{k_mix} with i-th component a vector of standardized sixth cumulants for component distributions of \eqn{Y_{mix_i}}
#'     (not necessary for \code{method} = "Fleishman")
#' @param mix_Six a list of length \code{k_mix} with i-th component a list of vectors of sixth cumulant correction values
#'     for component distributions of \eqn{Y_{mix_i}}; use \code{NULL} if no correction is desired for a given component or
#'     mixture variable; if no correction is desired for any of the \eqn{Y_{mix}} keep as \code{mix_Six = list()}
#'     (not necessary for \code{method} = "Fleishman")
#' @param marginal a list of length equal to \code{k_cat}; the i-th element is a vector of the cumulative
#'     probabilities defining the marginal distribution of the i-th variable;
#'     if the variable can take r values, the vector will contain r - 1 probabilities (the r-th is assumed to be 1);
#'     for binary variables, these should be input the same as for ordinal variables with more than 2 categories (i.e. the user-specified
#'     probability is the probability of the 1st category, which has the smaller support value)
#' @param support a list of length equal to \code{k_cat}; the i-th element is a vector containing the r ordered support values;
#'     if not provided (i.e. \code{support = list()}), the default is for the i-th element to be the vector 1, ..., r
#' @param lam a vector of lambda (mean > 0) constants for the Poisson variables (see \code{stats::dpois}); the order should be
#'     1st regular Poisson variables, 2nd zero-inflated Poisson variables
#' @param p_zip a vector of probabilities of structural zeros (not including zeros from the Poisson distribution) for the
#'     zero-inflated Poisson variables (see \code{VGAM::dzipois}); if \code{p_zip} = 0, \eqn{Y_{pois}} has a regular Poisson
#'     distribution; if \code{p_zip} is in (0, 1), \eqn{Y_{pois}} has a zero-inflated Poisson distribution;
#'     if \code{p_zip} is in \code{(-(exp(lam) - 1)^(-1), 0)}, \eqn{Y_{pois}} has a zero-deflated Poisson distribution and \code{p_zip}
#'     is not a probability; if \code{p_zip = -(exp(lam) - 1)^(-1)}, \eqn{Y_{pois}} has a positive-Poisson distribution
#'     (see \code{VGAM::dpospois}); if \code{length(p_zip) < length(lam)}, the missing values are set to 0 (and ordered 1st)
#' @param size a vector of size parameters for the Negative Binomial variables (see \code{stats::dnbinom}); the order should be
#'     1st regular NB variables, 2nd zero-inflated NB variables
#' @param prob a vector of success probability parameters for the NB variables; order the same as in \code{size}
#' @param mu a vector of mean parameters for the NB variables (*Note: either \code{prob} or \code{mu} should be supplied for all Negative Binomial variables,
#'     not a mixture; default = NULL); order the same as in \code{size}; for zero-inflated NB this refers to
#'     the mean of the NB distribution (see \code{VGAM::dzinegbin})
#' @param p_zinb a vector of probabilities of structural zeros (not including zeros from the NB distribution) for the zero-inflated NB variables
#'     (see \code{VGAM::dzinegbin}); if \code{p_zinb} = 0, \eqn{Y_{nb}} has a regular NB distribution;
#'     if \code{p_zinb} is in \code{(-prob^size/(1 - prob^size),} \code{0)}, \eqn{Y_{nb}} has a zero-deflated NB distribution and \code{p_zinb}
#'     is not a probability; if \code{p_zinb = -prob^size/(1 - prob^size)}, \eqn{Y_{nb}} has a positive-NB distribution (see
#'     \code{VGAM::dposnegbin}); if \code{length(p_zinb) < length(size)}, the missing values are set to 0 (and ordered 1st)
#' @param pois_eps a vector of length \code{k_pois} containing total cumulative probability truncation values; if none are provided,
#'     the default is 0.0001 for each variable
#' @param nb_eps a vector of length \code{k_nb} containing total cumulative probability truncation values; if none are provided,
#'     the default is 0.0001 for each variable
#' @param rho the target correlation matrix which must be ordered
#'     \emph{1st ordinal, 2nd continuous non-mixture, 3rd components of continuous mixtures, 4th regular Poisson, 5th zero-inflated Poisson,
#'     6th regular NB, 7th zero-inflated NB}; note that \code{rho} is specified in terms of the components of \eqn{Y_{mix}}
#' @param seed the seed value for random number generation (default = 1234)
#' @param errorloop if TRUE, uses \code{\link[SimCorrMix]{corr_error}} to attempt to correct final pairwise correlations to be within
#'     \code{epsilon} of target pairwise correlations (default = FALSE)
#' @param epsilon the maximum acceptable error between the final and target pairwise correlations (default = 0.001)
#'     in the calculation of ordinal intermediate correlations with \code{\link[SimCorrMix]{ord_norm}} or in the error loop
#' @param maxit the maximum number of iterations to use (default = 1000) in the calculation of ordinal
#'     intermediate correlations with \code{\link[SimCorrMix]{ord_norm}} or in the error loop
#' @param use.nearPD TRUE to convert the overall intermediate correlation matrix to the nearest positive definite matrix with \code{Matrix::nearPD} if
#'     necessary; if FALSE the negative eigenvalues are replaced with 0 if necessary
#' @param Sigma an intermediate correlation matrix to use if the user wants to provide one, else it is calculated within by
#'     \code{\link[SimCorrMix]{intercorr2}}
#' @param cstart a list of length equal to \code{k_cont} + the total number of mixture components containing initial values for root-solving
#'     algorithm used in \code{\link[SimMultiCorrData]{find_constants}}.  If user specified, each list element must be input as a matrix.
#'     For \code{method} = "Fleishman", each should have 3 columns for \eqn{c_1, c_2, c_3};
#'     for \code{method} = "Polynomial", each should have 5 columns for \eqn{c_1, c_2, c_3, c_4, c_5}.  If no starting values are specified for
#'     a given component, that list element should be \code{NULL}.
#' @param quiet if FALSE prints simulation messages, if TRUE suppresses message printing
#' @import SimMultiCorrData
#' @importFrom stats cor dbeta dbinom dchisq density dexp df dgamma dlnorm dlogis dmultinom dnbinom dnorm dpois dt dunif dweibull ecdf
#'     median pbeta pbinom pchisq pexp pf pgamma plnorm plogis pnbinom pnorm ppois pt punif pweibull qbeta qbinom qchisq qexp qf qgamma
#'     qlnorm qlogis qnbinom qnorm qpois qt quantile qunif qweibull rbeta rbinom rchisq rexp rf rgamma rlnorm rlogis rmultinom rnbinom
#'     rnorm rpois rt runif rweibull sd uniroot var
#' @import utils
#' @importFrom Matrix nearPD
#' @importFrom VGAM qzipois qzinegbin
#' @export
#' @keywords simulation continuous mixture ordinal Poisson NegativeBinomial Fleishman Headrick method2
#' @seealso \code{\link[SimMultiCorrData]{find_constants}}, \code{\link[SimCorrMix]{validpar}}, \code{\link[SimCorrMix]{validcorr2}},
#'     \code{\link[SimCorrMix]{intercorr2}}, \code{\link[SimCorrMix]{corr_error}}, \code{\link[SimCorrMix]{summary_var}}
#' @return A list whose components vary based on the type of simulated variables.
#' @return If \bold{ordinal variables} are produced: \code{Y_cat} the ordinal variables,
#'
#' @return If \bold{continuous variables} are produced:
#'
#'     \code{constants} a data.frame of the constants,
#'
#'     \code{Y_cont} the continuous non-mixture variables,
#'
#'     \code{Y_comp} the components of the continuous mixture variables,
#'
#'     \code{Y_mix} the continuous mixture variables,
#'
#'     \code{sixth_correction} a list of sixth cumulant correction values,
#'
#'     \code{valid.pdf} a vector where the i-th element is "TRUE" if the constants for the i-th continuous variable generate a valid PDF, else "FALSE"
#' @return If \bold{Poisson variables} are produced: \code{Y_pois} the regular and zero-inflated Poisson variables,
#'
#' @return If \bold{Negative Binomial variables} are produced: \code{Y_nb} the regular and zero-inflated Negative Binomial variables,
#'
#' @return Additionally, the following elements:
#'
#'     \code{Sigma} the intermediate correlation matrix (after the error loop),
#'
#'     \code{Error_Time} the time in minutes required to use the error loop,
#'
#'     \code{Time} the total simulation time in minutes,
#'
#'     \code{niter} a matrix of the number of iterations used for each variable in the error loop,
#'
#' @references
#' Barbiero A & Ferrari PA (2015). Simulation of correlated Poisson variables. Applied Stochastic Models in
#'     Business and Industry, 31:669-80. \doi{10.1002/asmb.2072}.
#'
#' Barbiero A & Ferrari PA (2015). GenOrd: Simulation of Discrete Random Variables with Given
#'     Correlation Matrix and Marginal Distributions. R package version 1.4.0. \cr \url{https://CRAN.R-project.org/package=GenOrd}
#'
#' Davenport JW, Bezder JC, & Hathaway RJ (1988). Parameter Estimation for Finite Mixture Distributions.
#'     Computers & Mathematics with Applications, 15(10):819-28.
#'
#' Demirtas H (2006). A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical
#'     Computation and Simulation, 76(11):1017-1025. \cr \doi{10.1080/10629360600569246}.
#'
#' Demirtas H (2014). Joint Generation of Binary and Nonnormal Continuous Data. Biometrics & Biostatistics, S12.
#'
#' Demirtas H, Hedeker D, & Mermelstein RJ (2012). Simulation of massive public health data by power polynomials.
#'     Statistics in Medicine, 31(27):3337-3346. \doi{10.1002/sim.5362}.
#'
#' Everitt BS (1996). An Introduction to Finite Mixture Distributions. Statistical Methods in Medical Research, 5(2):107-127. \doi{10.1177/096228029600500202}.
#'
#' Ferrari PA & Barbiero A (2012). Simulating ordinal data. Multivariate Behavioral Research, 47(4): 566-589.
#'     \doi{10.1080/00273171.2012.692630}.
#'
#' Fialkowski AC (2018). SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. R package version 0.2.2.
#'     \url{https://CRAN.R-project.org/package=SimMultiCorrData}.
#'
#' Fleishman AI (1978). A Method for Simulating Non-normal Distributions. Psychometrika, 43:521-532. \doi{10.1007/BF02293811}.
#'
#' Headrick TC (2002). Fast Fifth-order Polynomial Transforms for Generating Univariate and Multivariate
#'     Non-normal Distributions. Computational Statistics & Data Analysis, 40(4):685-711. \doi{10.1016/S0167-9473(02)00072-5}.
#'     (\href{http://www.sciencedirect.com/science/article/pii/S0167947302000725}{ScienceDirect})
#'
#' Headrick TC (2004). On Polynomial Transformations for Simulating Multivariate Nonnormal Distributions.
#'     Journal of Modern Applied Statistical Methods, 3(1):65-71. \doi{10.22237/jmasm/1083370080}.
#'
#' Headrick TC, Kowalchuk RK (2007). The Power Method Transformation: Its Probability Density Function, Distribution
#'     Function, and Its Further Use for Fitting Data. Journal of Statistical Computation and Simulation, 77:229-249. \doi{10.1080/10629360600605065}.
#'
#' Headrick TC, Sawilowsky SS (1999). Simulating Correlated Non-normal Distributions: Extending the Fleishman Power
#'     Method. Psychometrika, 64:25-35. \doi{10.1007/BF02294317}.
#'
#' Headrick TC, Sheng Y, & Hodis FA (2007). Numerical Computing and Graphics for the Power Method Transformation Using
#'     Mathematica. Journal of Statistical Software, 19(3):1 - 17. \cr \doi{10.18637/jss.v019.i03}.
#'
#' Higham N (2002). Computing the nearest correlation matrix - a problem from finance; IMA Journal of Numerical Analysis 22:329-343.
#'
#' Ismail N & Zamani H (2013). Estimation of Claim Count Data Using Negative Binomial, Generalized Poisson, Zero-Inflated Negative Binomial and
#'     Zero-Inflated Generalized Poisson Regression Models. Casualty Actuarial Society E-Forum 41(20):1-28.
#'
#' Lambert D (1992). Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics 34(1):1-14.
#'
#' Olsson U, Drasgow F, & Dorans NJ (1982). The Polyserial Correlation Coefficient. Psychometrika, 47(3):337-47.
#'     \doi{10.1007/BF02294164}.
#'
#' Pearson RK (2011). Exploring Data in Engineering, the Sciences, and Medicine. In. New York: Oxford University Press.
#'
#' Schork NJ, Allison DB, & Thiel B (1996). Mixture Distributions in Human Genetics Research. Statistical Methods in Medical Research,
#'     5:155-178. \doi{10.1177/096228029600500204}.
#'
#' Vale CD & Maurelli VA (1983). Simulating Multivariate Nonnormal Distributions. Psychometrika, 48:465-471. \doi{10.1007/BF02293687}.
#'
#' Yee TW (2018). VGAM: Vector Generalized Linear and Additive Models. R package version 1.0-5. \url{https://CRAN.R-project.org/package=VGAM}.
#'
#' Zhang X, Mallick H, & Yi N (2016). Zero-Inflated Negative Binomial Regression for Differential Abundance Testing in Microbiome
#'     Studies. Journal of Bioinformatics and Genomics 2(2):1-9. \doi{10.18454/jbg.2016.2.2.1}.
#'
#' @examples
#' Sim1 <- corrvar2(n = 1000, k_cat = 1, k_cont = 1, method = "Polynomial",
#'   means = 0, vars = 1, skews = 0, skurts = 0, fifths = 0, sixths = 0,
#'   marginal = list(c(1/3, 2/3)), support = list(0:2),
#'   rho = matrix(c(1, 0.4, 0.4, 1), 2, 2), quiet = TRUE)
#'
#' \dontrun{
#'
#' # 2 continuous mixture, 1 binary, 1 zero-inflated Poisson, and
#' # 1 zero-inflated NB variable
#' n <- 10000
#' seed <- 1234
#'
#' # Mixture variables: Normal mixture with 2 components;
#' # mixture of Logistic(0, 1), Chisq(4), Beta(4, 1.5)
#' # Find cumulants of components of 2nd mixture variable
#' L <- calc_theory("Logistic", c(0, 1))
#' C <- calc_theory("Chisq", 4)
#' B <- calc_theory("Beta", c(4, 1.5))
#'
#' skews <- skurts <- fifths <- sixths <- NULL
#' Six <- list()
#' mix_pis <- list(c(0.4, 0.6), c(0.3, 0.2, 0.5))
#' mix_mus <- list(c(-2, 2), c(L[1], C[1], B[1]))
#' mix_sigmas <- list(c(1, 1), c(L[2], C[2], B[2]))
#' mix_skews <- list(rep(0, 2), c(L[3], C[3], B[3]))
#' mix_skurts <- list(rep(0, 2), c(L[4], C[4], B[4]))
#' mix_fifths <- list(rep(0, 2), c(L[5], C[5], B[5]))
#' mix_sixths <- list(rep(0, 2), c(L[6], C[6], B[6]))
#' mix_Six <- list(list(NULL, NULL), list(1.75, NULL, 0.03))
#' Nstcum <- calc_mixmoments(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]],
#'   mix_skews[[1]], mix_skurts[[1]], mix_fifths[[1]], mix_sixths[[1]])
#' Mstcum <- calc_mixmoments(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]],
#'   mix_skews[[2]], mix_skurts[[2]], mix_fifths[[2]], mix_sixths[[2]])
#' means <- c(Nstcum[1], Mstcum[1])
#' vars <- c(Nstcum[2]^2, Mstcum[2]^2)
#'
#' marginal <- list(0.3)
#' support <- list(c(0, 1))
#' lam <- 0.5
#' p_zip <- 0.1
#' pois_eps <- 0.0001
#' size <- 2
#' prob <- 0.75
#' p_zinb <- 0.2
#' nb_eps <- 0.0001
#'
#' k_cat <- k_pois <- k_nb <- 1
#' k_cont <- 0
#' k_mix <- 2
#' Rey <- matrix(0.39, 8, 8)
#' diag(Rey) <- 1
#' rownames(Rey) <- colnames(Rey) <- c("O1", "M1_1", "M1_2", "M2_1", "M2_2",
#'   "M2_3", "P1", "NB1")
#'
#' # set correlation between components of the same mixture variable to 0
#' Rey["M1_1", "M1_2"] <- Rey["M1_2", "M1_1"] <- 0
#' Rey["M2_1", "M2_2"] <- Rey["M2_2", "M2_1"] <- Rey["M2_1", "M2_3"] <- 0
#' Rey["M2_3", "M2_1"] <- Rey["M2_2", "M2_3"] <- Rey["M2_3", "M2_2"] <- 0
#'
#' # check parameter inputs
#' validpar(k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
#'   vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
#'   mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal, support,
#'   lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey)
#'
#' # check to make sure Rey is within the feasible correlation boundaries
#' validcorr2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
#'   vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
#'   mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal,
#'   lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey, seed)
#'
#' # simulate without the error loop
#' Sim2 <- corrvar2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
#'   vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
#'   mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal, support,
#'   lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps, nb_eps, Rey, seed,
#'   epsilon = 0.01)
#'
#' names(Sim2)
#'
#' # simulate with the error loop
#' Sim2_EL <- corrvar2(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial",
#'   means, vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus,
#'   mix_sigmas, mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six,
#'   marginal, support, lam, p_zip, size, prob, mu = NULL, p_zinb, pois_eps,
#'   nb_eps, Rey, seed, errorloop = TRUE, epsilon = 0.01)
#'
#' names(Sim2_EL)
#' }
corrvar2 <- function(n = 10000, k_cat = 0, k_cont = 0, k_mix = 0, k_pois = 0,
                     k_nb = 0, method = c("Fleishman", "Polynomial"),
                     means =  NULL, vars =  NULL, skews =  NULL,
                     skurts =  NULL, fifths =  NULL, sixths =  NULL,
                     Six = list(), mix_pis = list(), mix_mus = list(),
                     mix_sigmas = list(), mix_skews =  list(),
                     mix_skurts =  list(), mix_fifths =  list(),
                     mix_sixths =  list(), mix_Six = list(), marginal = list(),
                     support = list(), lam  =  NULL, p_zip = 0, size = NULL,
                     prob = NULL, mu = NULL, p_zinb = 0, pois_eps = 0.0001,
                     nb_eps = 0.0001, rho = NULL, seed = 1234,
                     errorloop = FALSE, epsilon = 0.001, maxit = 1000,
                     use.nearPD = TRUE, Sigma = NULL, cstart = list(),
                     quiet = FALSE) {
  start.time <- Sys.time()
  k <- k_cat + k_cont + k_mix + k_pois + k_nb
  if (k_pois > 0) {
    if (length(p_zip) < k_pois)
      p_zip <- c(rep(0, k_pois - length(p_zip)), p_zip)
    if (length(pois_eps) < k_pois)
      pois_eps <- rep(0.0001, k_pois)
  }
  if (k_nb > 0) {
    if (length(prob) > 0)
      mu <- size * (1 - prob)/prob
    if (length(p_zinb) < k_nb)
      p_zinb <- c(rep(0, k_nb - length(p_zinb)), p_zinb)
    if (length(nb_eps) < k_nb)
      nb_eps <- rep(0.0001, k_nb)
  }
  if (is.null(means) & (k_cont + k_mix) > 0) {
    means <- rep(0, k_cont + k_mix)
  }
  if (is.null(vars) & (k_cont + k_mix) > 0) {
    vars <- rep(1, k_cont + k_mix)
  }
  csame.dist <- NULL
  msame.dist <- NULL
  if (length(skews) >= 2) {
    for (i in 2:length(skews)) {
      if (skews[i] %in% skews[1:(i - 1)]) {
        csame <- which(skews[1:(i - 1)] == skews[i])
        for (j in 1:length(csame)) {
          if (method == "Polynomial") {
            if ((skurts[i] == skurts[csame[j]]) &
                (fifths[i] == fifths[csame[j]]) &
                (sixths[i] == sixths[csame[j]])) {
              csame.dist <- rbind(csame.dist, c(csame[j], i))
              break
            }
          }
          if (method == "Fleishman") {
            if (skurts[i] == skurts[csame[j]]) {
              csame.dist <- rbind(csame.dist, c(csame[j], i))
              break
            }
          }
        }
      }
    }
  }
  mix_skews2 <- NULL
  mix_skurts2 <- NULL
  mix_fifths2 <- NULL
  mix_sixths2 <- NULL
  if (length(mix_pis) >= 1) {
    k.comp <- c(0, cumsum(unlist(lapply(mix_pis, length))))
    mix_mus2 <- unlist(mix_mus)
    mix_sigmas2 <- unlist(mix_sigmas)
    mix_skews2 <- unlist(mix_skews)
    mix_skurts2 <- unlist(mix_skurts)
    mix_Six2 <- list()
    if (method == "Polynomial") {
      mix_fifths2 <- unlist(mix_fifths)
      mix_sixths2 <- unlist(mix_sixths)
      if (length(mix_Six) > 0) {
        if (class(mix_Six[[1]]) == "numeric") mix_Six2 <- mix_Six
        if (class(mix_Six[[1]]) == "list") mix_Six2 <- do.call(append, mix_Six)
      }
    }
    for (i in 1:length(mix_skews2)) {
      msame.dist2 <- NULL
      if (length(skews) >= 1) {
        if (mix_skews2[i] %in% skews) {
          msame <- which(skews == mix_skews2[i])
          for (j in 1:length(msame)) {
            if (method == "Polynomial") {
              if ((mix_skurts2[i] == skurts[msame[j]]) &
                  (mix_fifths2[i] == fifths[msame[j]]) &
                  (mix_sixths2[i] == sixths[msame[j]])) {
                msame.dist2 <- c(msame[j], i)
                break
              }
            }
            if (method == "Fleishman") {
              if (mix_skurts2[i] == skurts[msame[j]]) {
                msame.dist2 <- c(msame[j], i)
                break
              }
            }
          }
        }
      }
      if (is.null(msame.dist2) & i >= 2) {
        if (mix_skews2[i] %in% mix_skews2[1:(i-1)]) {
          msame <- which(mix_skews2[1:(i - 1)] == mix_skews2[i])
          for (j in 1:length(msame)) {
            if (method == "Polynomial") {
              if ((mix_skurts2[i] == mix_skurts2[msame[j]]) &
                  (mix_fifths2[i] == mix_fifths2[msame[j]]) &
                  (mix_sixths2[i] == mix_sixths2[msame[j]])) {
                msame.dist2 <- c(k_cont + msame[j], i)
                break
              }
            }
            if (method == "Fleishman") {
              if (mix_skurts2[i] == mix_skurts2[msame[j]]) {
                msame.dist2 <- c(k_cont + msame[j], i)
                break
              }
            }
          }
        }
      }
      msame.dist <- rbind(msame.dist, msame.dist2)
    }
  }
  if ((k_cont + k_mix) >= 1) {
    SixCorr <- numeric(k_cont + length(mix_skews2))
    Valid.PDF <- numeric(k_cont + length(mix_skews2))
    if (method == "Fleishman") {
      constants <- matrix(NA, nrow = (k_cont + length(mix_skews2)), ncol = 4)
      colnames(constants) <- c("c0", "c1", "c2", "c3")
    }
    if (method == "Polynomial") {
      constants <- matrix(NA, nrow = (k_cont + length(mix_skews2)), ncol = 6)
      colnames(constants) <- c("c0", "c1", "c2", "c3", "c4", "c5")
    }
  }
  if (k_cont >= 1) {
    for (i in 1:k_cont) {
      if (!is.null(csame.dist)) {
        rind <- which(csame.dist[, 2] == i)
        if (length(rind) > 0) {
          constants[i, ] <- constants[csame.dist[rind, 1], ]
          SixCorr[i] <- SixCorr[csame.dist[rind, 1]]
          Valid.PDF[i] <- Valid.PDF[csame.dist[rind, 1]]
        }
      }
      if (sum(is.na(constants[i, ])) > 0) {
        if (length(Six) == 0) Six2 <- NULL else
          Six2 <- Six[[i]]
        if (length(cstart) == 0) cstart2 <- NULL else
          cstart2 <- cstart[[i]]
        cons <-
          suppressWarnings(find_constants(method = method, skews = skews[i],
            skurts = skurts[i], fifths = fifths[i], sixths = sixths[i],
            Six = Six2, cstart = cstart2, n = 25, seed = seed))
        if (length(cons) == 1 | is.null(cons)) {
          stop(paste("Constants can not be found for continuous variable ", i,
                     ".", sep = ""))
        }
        con_solution <- cons$constants
        SixCorr[i] <- ifelse(is.null(cons$SixCorr1), NA, cons$SixCorr1)
        Valid.PDF[i] <- cons$valid
        constants[i, ] <- con_solution
      }
    }
  }
  if (k_mix >= 1) {
    for (i in 1:length(mix_skews2)) {
      if (!is.null(msame.dist)) {
        rind <- which(msame.dist[, 2] == i)
        if (length(rind) > 0) {
          constants[(k_cont + i), ] <- constants[msame.dist[rind, 1], ]
          SixCorr[k_cont + i] <- SixCorr[msame.dist[rind, 1]]
          Valid.PDF[k_cont + i] <- Valid.PDF[msame.dist[rind, 1]]
        }
      }
      if (sum(is.na(constants[(k_cont + i), ])) > 0) {
        if (length(mix_Six2) == 0) Six2 <- NULL else
          Six2 <- mix_Six2[[i]]
        if (length(cstart) == 0) cstart2 <- NULL else
          cstart2 <- cstart[[k_cont + i]]
        cons <-
          suppressWarnings(find_constants(method = method,
            skews = mix_skews2[i], skurts = mix_skurts2[i],
            fifths = mix_fifths2[i], sixths = mix_sixths2[i], Six = Six2,
            cstart = cstart2, n = 25, seed = seed))
        if (length(cons) == 1 | is.null(cons)) {
          stop(paste("Constants can not be found for component variable ", i,
                     ".", sep = ""))
        }
        con_solution <- cons$constants
        SixCorr[k_cont + i] <- ifelse(is.null(cons$SixCorr1), NA,
                                      cons$SixCorr1)
        Valid.PDF[k_cont + i] <- cons$valid
        constants[(k_cont + i), ] <- con_solution
      }
    }
  }
  if (k_cat > 0) {
    if (length(support) == 0) {
      support <- lapply(marginal, function(x) 1:(length(x) + 1))
    } else {
      for (i in 1:k_cat) {
        if (length(support[[i]]) != (length(marginal[[i]]) + 1))
          support[[i]] <- 1:(length(marginal[[i]]) + 1)
      }
    }
  }
  if (is.null(Sigma)) {
    Sigma <- intercorr2(k_cat = k_cat, k_cont = k_cont + length(mix_skews2),
      k_pois = k_pois, k_nb = k_nb, method = method, constants = constants,
      marginal = marginal, support = support, lam = lam, p_zip = p_zip,
      size = size, mu = mu, p_zinb = p_zinb, pois_eps = pois_eps,
      nb_eps = nb_eps, rho = rho, epsilon = epsilon, maxit = maxit,
      quiet = quiet)
  }
  if (min(eigen(Sigma, symmetric = TRUE)$values) < 0) {
    if (use.nearPD == TRUE) {
      Sigma <- as.matrix(nearPD(Sigma, corr = T, keepDiag = T)$mat)
      if (quiet == FALSE)
        message("Intermediate correlation matrix is not positive definite.
Nearest positive definite matrix is used.")
    } else if (quiet == FALSE) {
      message("Intermediate correlation matrix is not positive definite.
Negative eigenvalues are replaced with 0.  Set use.nearPD = TRUE to use nearest
positive-definite matrix instead.")
    }
  }
  eig <- eigen(Sigma, symmetric = TRUE)
  sqrteigval <- diag(sqrt(pmax(eig$values, 0)), nrow = nrow(Sigma),
                     ncol = ncol(Sigma))
  eigvec <- eig$vectors
  fry <- eigvec %*% sqrteigval
  set.seed(seed)
  X <- matrix(rnorm(ncol(Sigma) * n), n)
  X <- scale(X, TRUE, FALSE)
  X <- X %*% svd(X, nu = 0)$v
  X <- scale(X, FALSE, TRUE)
  X <- fry %*% t(X)
  X <- t(X)
  if (k_cat > 0) {
    X_cat <- X[, 1:k_cat, drop = FALSE]
  }
  if ((k_cont + length(mix_skews2)) > 0) {
    X_cont <- X[, (k_cat + 1):(k_cat + k_cont + length(mix_skews2)),
                drop = FALSE]
  }
  if (k_pois > 0) {
    X_pois <- X[, (k_cat + k_cont + length(mix_skews2) + 1):(k_cat + k_cont +
                length(mix_skews2) + k_pois), drop = FALSE]
  }
  if (k_nb > 0) {
    X_nb <- X[, (ncol(X) - k_nb + 1):ncol(X), drop = FALSE]
  }
  Y_cat <- NULL
  Y_cont <- NULL
  Y_mix <- NULL
  Y_pois <- NULL
  Y_nb <- NULL
  means2 <- NULL
  vars2 <- NULL
  if (k_cat > 0) {
    Y_cat <- matrix(1, nrow = n, ncol = k_cat)
    for (i in 1:length(marginal)) {
      Y_cat[, i] <- as.integer(cut(X_cat[, i], breaks = c(min(X_cat[, i]) - 1,
                                qnorm(marginal[[i]]), max(X_cat[, i])  +  1)))
      Y_cat[, i] <- support[[i]][Y_cat[, i]]
    }
  }
  if ((k_cont + length(mix_skews2)) > 0) {
    Y <- matrix(1, nrow = n, ncol = k_cont + length(mix_skews2))
    for (i in 1:(k_cont + length(mix_skews2))) {
      if (method == "Fleishman") {
        Y[, i] <- constants[i, 1] + constants[i, 2] * X_cont[, i] +
          constants[i, 3] * X_cont[, i]^2 + constants[i, 4] * X_cont[, i]^3
      }
      if (method == "Polynomial") {
        Y[, i] <- constants[i, 1] + constants[i, 2] * X_cont[, i] +
          constants[i, 3] * X_cont[, i]^2 + constants[i, 4] * X_cont[, i]^3 +
          constants[i, 5] * X_cont[, i]^4 + constants[i, 6] * X_cont[, i]^5
      }
    }
    if (k_cont > 0) {
      Y_cont <- matrix(1, n, k_cont)
      means2 <- means[1:k_cont]
      vars2 <- vars[1:k_cont]
      for (i in 1:k_cont) {
        Y_cont[, i] <- means2[i] + sqrt(vars2[i]) * Y[, i]
      }
    }
    if (length(mix_skews2) > 0) {
      Y_mix <- matrix(1, n, length(mix_skews2))
      means2 <- c(means2, mix_mus2)
      vars2 <- c(vars2, mix_sigmas2^2)
      for (i in 1:length(mix_skews2)) {
        Y_mix[, i] <- mix_mus2[i] + mix_sigmas2[i] * Y[, (k_cont + i)]
      }
    }
  }
  if (k_pois > 0) {
    Y_pois <- matrix(1, nrow = n, ncol = k_pois)
    for (i in 1:k_pois) {
      Y_pois[, i] <- qzipois(pnorm(X_pois[, i]), lam[i], pstr0 = p_zip[i])
    }
  }
  if (k_nb > 0) {
    Y_nb <- matrix(1, nrow = n, ncol = k_nb)
    for (i in 1:k_nb) {
      Y_nb[, i] <- qzinegbin(pnorm(X_nb[, i]), size = size[i], munb = mu[i],
                             pstr0 = p_zinb[i])
    }
  }
  start.time.error <- Sys.time()
  niter <- diag(0, k, k)
  rho_calc <- cor(cbind(Y_cat, Y_cont, Y_mix, Y_pois, Y_nb))
  emax <- max(abs(rho_calc - rho))
  if (emax > epsilon & errorloop == TRUE) {
    EL <- corr_error(n = n, k_cat = k_cat,
      k_cont = k_cont + length(mix_skews2), k_pois = k_pois, k_nb = k_nb,
      method = method, means = means2, vars = vars2, constants = constants,
      marginal = marginal, support = support, lam = lam, p_zip = p_zip,
      size = size, mu = mu, p_zinb = p_zinb, seed = seed,
      epsilon = epsilon, maxit = maxit, rho0 = rho, Sigma = Sigma,
      rho_calc = rho_calc)
    Sigma <- EL$Sigma
    Y_cat <- EL$Y_cat
    Y <- EL$Y
    if (k_cont > 0) Y_cont <- EL$Y_cont[, 1:k_cont, drop = FALSE]
    if (k_mix > 0)
      Y_mix <- EL$Y_cont[, (ncol(EL$Y_cont) - length(mix_skews2) +
                              1):ncol(EL$Y_cont), drop = FALSE]
    Y_pois <- EL$Y_pois
    Y_nb <- EL$Y_nb
    niter <- EL$niter
  }
  stop.time.error <- Sys.time()
  niter <- as.data.frame(niter)
  rownames(niter) <- c(1:ncol(niter))
  colnames(niter) <- c(1:ncol(niter))
  Y_mix2 <- NULL
  if (k_mix > 0) {
    Y_mix2 <- matrix(1, n, k_mix)
    for (i in 1:k_mix) {
      set.seed(seed)
      M <- rmultinom(n, size = 1, prob = mix_pis[[i]])
      Y_mix2[, i] <- apply(t(M) * Y_mix[, (k.comp[i] + 1):k.comp[i + 1]], 1,
                           sum)
      Y_mix2[, i] <- scale(Y_mix2[, i])
      Y_mix2[, i] <- means[k_cont + i] + sqrt(vars[k_cont + i]) * Y_mix2[, i]
      seed <- seed + 1
    }
  }
  result <- list()
  if (k_cat > 0) result <- append(result, list(Y_cat = Y_cat))
  if ((k_cont + k_mix) > 0) {
    result <- append(result, list(constants = as.data.frame(constants),
                                  Y_cont = Y_cont, Y_comp = Y_mix, sixth_correction = SixCorr,
                                  valid.pdf = Valid.PDF))
    if (k_mix > 0) result <- append(result, list(Y_mix = Y_mix2))
  }
  if (k_pois > 0) result <- append(result, list(Y_pois = Y_pois))
  if (k_nb > 0) result <- append(result, list(Y_nb = Y_nb))
  stop.time <- Sys.time()
  Time.error <- round(difftime(stop.time.error, start.time.error,
                               units = "min"), 3)
  Time <- round(difftime(stop.time, start.time, units = "min"), 3)
  if (quiet == FALSE) cat("Total Simulation time:", Time, "minutes \n")
  result <- append(result, list(Sigma = Sigma, Error_Time = Time.error,
                                Time = Time, niter = niter))
  result
}

Try the SimCorrMix package in your browser

Any scripts or data that you put into this service are public.

SimCorrMix documentation built on May 2, 2019, 1:24 p.m.