R/nonnormvar1.R
In SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types

Documented in nonnormvar1

#' @title Generation of One Non-Normal Continuous Variable Using the Power Method
#'
#' @description This function simulates one non-normal continuous variable using either Fleishman's Third-Order (\code{method} = "Fleishman",
#'     \doi{10.1007/BF02293811}) or Headrick's Fifth-Order (\code{method} = "Polynomial", \doi{10.1016/S0167-9473(02)00072-5}) Polynomial
#'     Transformation.  If only one variable is desired and that variable is continuous, this function should be used.  The power method
#'     transformation is a computationally efficient algorithm that simulates continuous distributions through the method of moments.
#'     It works by matching standardized cumulants -- the first four (mean, variance, skew, and standardized kurtosis) for Fleishman's
#'     method, or the first six (mean, variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick's
#'     method.  The transformation is expressed as follows:
#'
#'     \eqn{Y = c_{0} + c_{1} * Z + c_{2} * Z^2 + c_{3} * Z^3 + c_{4} * Z^4 + c_{5} * Z^5},
#'
#'     where \eqn{Z ~ N(0,1)}, and \eqn{c_{4}} and \eqn{c_{5}} both equal \eqn{0} for Fleishman's method.  The real constants are calculated by
#'     \code{\link[SimMultiCorrData]{find_constants}}.  All variables are simulated with mean \eqn{0} and variance \eqn{1}, and then
#'     transformed to the specified mean and variance at the end.
#'
#'     The required parameters for simulating continuous variables include: mean, variance, skewness, standardized kurtosis (kurtosis - 3), and
#'     standardized fifth and sixth cumulants (for \code{method} = "Polynomial").  If the goal is to simulate a theoretical distribution
#'     (i.e. Gamma, Beta, Logistic, etc.), these values can be obtained using \code{\link[SimMultiCorrData]{calc_theory}}.  If the goal is to
#'     mimic an empirical data set, these values can be found using \code{\link[SimMultiCorrData]{calc_moments}} (using the method of moments) or
#'     \code{\link[SimMultiCorrData]{calc_fisherk}} (using Fisher's k-statistics).  If the standardized cumulants
#'     are obtained from \code{calc_theory}, the user may need to use rounded values as inputs (i.e. \code{skews = round(skews, 8)}).  Due to the nature
#'     of the integration involved in \code{calc_theory}, the results are approximations.  Greater accuracy can be achieved by increasing the number of
#'     subdivisions (\code{sub}) used in the integration process.  For example, in order to ensure that skew is exactly 0 for symmetric distributions.
#'
#'     For some sets of cumulants, it is either not possible to find power method constants or the
#'     calculated constants do not generate valid power method pdfs.  In these situations, adding a value to the sixth cumulant may
#'     provide solutions (see \code{\link[SimMultiCorrData]{find_constants}}).  If simulation results indicate that a continuous variable
#'     does not generate a valid pdf, the user can try \code{\link[SimMultiCorrData]{find_constants}} with various sixth cumulant correction
#'     vectors to determine if a valid pdf can be found.
#'
#'     Headrick & Kowalchuk (2007, \doi{10.1080/10629360600605065}) outlined a general
#'     method for comparing a simulated distribution \eqn{Y} to a given theoretical distribution \eqn{Y^*}.  These steps can be found in
#'     the example and the \bold{Comparison of Simulated Distribution to Theoretical Distribution or Empirical Data} vignette.
#'
#' @section Choice of Fleishman's third-order or Headrick's fifth-order method:
#'     Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving
#'     accuracy.  In addition, the range of feasible standardized kurtosis values, given skew and standardized fifth (\eqn{\gamma_{3}}) and sixth
#'     (\eqn{\gamma_{4}}) cumulants, is larger than with Fleishman's method (see \code{\link[SimMultiCorrData]{calc_lower_skurt}}).
#'     For example, the Fleishman method can not be used to generate a
#'     non-normal distribution with a ratio of \eqn{\gamma_{3}^2/\gamma_{4} > 9/14} (see Headrick & Kowalchuk, 2007).  This eliminates the
#'     Chi-squared family of distributions, which has a constant ratio of \eqn{\gamma_{3}^2/\gamma_{4} = 2/3}.  However, if the fifth and
#'     sixth cumulants do not exist, the Fleishman approximation should be used.
#'
#' @section Overview of Simulation Process:
#'     1) The constants are calculated for the continuous variable using \code{\link[SimMultiCorrData]{find_constants}}.  If no
#'     solutions are found that generate a valid power method pdf, the function will return constants that produce an invalid pdf
#'     (or a stop error if no solutions can be found).  Possible solutions include: 1) changing the seed, or 2) using a \code{Six} vector
#'     of sixth cumulant correction values (if \code{method} = "Polynomial").  Errors regarding constant
#'     calculation are the most probable cause of function failure.
#'
#'     2) An intermediate standard normal variate X of length n is generated.
#'
#'     3) Summary statistics are calculated.
#'
#' @section Reasons for Function Errors:
#'     1) The most likely cause for function errors is that no solutions to \code{\link[SimMultiCorrData]{fleish}} or
#'     \code{\link[SimMultiCorrData]{poly}} converged when using \code{\link[SimMultiCorrData]{find_constants}}.  If this happens,
#'     the simulation will stop.  It may help to first use \code{\link[SimMultiCorrData]{find_constants}} for each continuous variable to
#'     determine if a vector of sixth cumulant correction values is needed.  The solutions can be used as starting values (see \code{cstart} below).
#'     If the standardized cumulants are obtained from \code{calc_theory}, the user may need to use rounded values as inputs (i.e.
#'     \code{skews = round(skews, 8)}).
#'
#'     2) In addition, the kurtosis may be outside the region of possible values.  There is an associated lower boundary for kurtosis associated
#'     with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method).  Use
#'     \code{\link[SimMultiCorrData]{calc_lower_skurt}} to determine the boundary for a given set of cumulants.
#'
#' @param method the method used to generate the continuous variable.  "Fleishman" uses Fleishman's third-order polynomial transformation
#'     and "Polynomial" uses Headrick's fifth-order transformation.
#' @param means mean for the continuous variable (default = 0)
#' @param vars variance (default = 1)
#' @param skews skewness value (default = 0)
#' @param skurts standardized kurtosis (kurtosis - 3, so that normal variables have a value of 0; default = 0)
#' @param fifths standardized fifth cumulant (not necessary for \code{method} = "Fleishman"; default = 0)
#' @param sixths standardized sixth cumulant (not necessary for \code{method} = "Fleishman"; default = 0)
#' @param Six a vector of correction values to add to the sixth cumulant if no valid pdf constants are found,
#'     ex: \code{Six = seq(0.01, 2, by = 0.01)}; if no correction is desired, set \code{Six = NULL} (default)
#' @param cstart initial values for root-solving algorithm (see \code{\link[BB]{multiStart}} for \code{method} = "Fleishman"
#'     or \code{\link[nleqslv]{nleqslv}} for \code{method} = "Polynomial").  If user specified,
#'     must be input as a matrix. If NULL and all 4 standardized cumulants (rounded to 3 digits) are within
#'     0.01 of those in Headrick's common distribution table (see \code{\link[SimMultiCorrData]{Headrick.dist}}
#'     data), uses his constants as starting values; else, generates n sets of random starting values from
#'     uniform distributions.
#' @param n the sample size (i.e. the length of the simulated variable; default = 10000)
#' @param seed the seed value for random number generation (default = 1234)
#' @importFrom psych describe
#' @import stats
#' @import utils
#' @import BB
#' @import nleqslv
#' @export
#' @keywords simulation, continuous, univariate, Fleishman, Headrick
#' @seealso \code{\link[SimMultiCorrData]{find_constants}}
#' @return A list with the following components:
#' @return \code{constants} a data.frame of the constants
#' @return \code{continuous_variable} a data.frame of the generated continuous variable
#' @return \code{summary_continuous} a data.frame containing a summary of the variable
#' @return \code{summary_targetcont} a data.frame containing a summary of the target variable
#' @return \code{sixth_correction} the sixth cumulant correction value
#' @return \code{valid.pdf} "TRUE" if constants generate a valid pdf, else "FALSE"
#' @return \code{Constants_Time} the time in minutes required to calculate the constants
#' @return \code{Simulation_Time} the total simulation time in minutes
#' @references
#' Fleishman AI (1978). A Method for Simulating Non-normal Distributions. Psychometrika, 43, 521-532. \doi{10.1007/BF02293811}.
#'
#' Headrick TC (2002). Fast Fifth-order Polynomial Transforms for Generating Univariate and Multivariate
#'     Non-normal Distributions. Computational Statistics & Data Analysis, 40(4):685-711. \doi{10.1016/S0167-9473(02)00072-5}.
#'     (\href{http://www.sciencedirect.com/science/article/pii/S0167947302000725}{ScienceDirect})
#'
#' Headrick TC (2004). On Polynomial Transformations for Simulating Multivariate Nonnormal Distributions.
#'     Journal of Modern Applied Statistical Methods, 3(1), 65-71. \doi{10.22237/jmasm/1083370080}.
#'
#' Headrick TC, Kowalchuk RK (2007). The Power Method Transformation: Its Probability Density Function, Distribution
#'     Function, and Its Further Use for Fitting Data. Journal of Statistical Computation and Simulation, 77, 229-249. \doi{10.1080/10629360600605065}.
#'
#' Headrick TC, Sawilowsky SS (1999). Simulating Correlated Non-normal Distributions: Extending the Fleishman Power
#'     Method. Psychometrika, 64, 25-35. \doi{10.1007/BF02294317}.
#'
#' Headrick TC, Sheng Y, & Hodis FA (2007). Numerical Computing and Graphics for the Power Method Transformation Using
#'     Mathematica. Journal of Statistical Software, 19(3), 1 - 17. \doi{10.18637/jss.v019.i03}.
#'
#' @examples
#' # Normal distribution with Headrick's fifth-order PMT:
#' N <- nonnormvar1("Polynomial", 0, 1, 0, 0, 0, 0)
#'
#' \dontrun{
#' # Use Headrick & Kowalchuk's (2007) steps to compare a simulated exponential
#' # (mean = 2) variable to the theoretical exponential(mean = 2) density:
#'
#' # 1) Obtain the standardized cumulants:
#' stcums <- calc_theory(Dist = "Exponential", params = 0.5) # rate = 1/mean
#'
#' # 2) Simulate the variable:
#' H_exp <- nonnormvar1("Polynomial", means = 2, vars = 2, skews = stcums[3],
#'                     skurts = stcums[4], fifths = stcums[5],
#'                     sixths = stcums[6], n = 10000, seed = 1234)
#'
#' H_exp$constants
#' #           c0        c1       c2         c3          c4           c5
#' # 1 -0.3077396 0.8005605 0.318764 0.03350012 -0.00367481 0.0001587076
#'
#' # 3) Determine whether the constants produce a valid power method pdf:
#'
#' H_exp$valid.pdf
#' # [1] "TRUE"
#'
#' # 4) Select a critical value:
#'
#' # Let alpha = 0.05.
#' y_star <- qexp(1 - 0.05, rate = 0.5) # note that rate = 1/mean
#' y_star
#' # [1] 5.991465
#'
#' # 5) Solve m_{2}^{1/2} * p(z') + m_{1} - y* = 0 for z', where m_{1} and
#' # m_{2} are the 1st and 2nd moments of Y*:
#'
#' # The exponential(2) distribution has a mean and standard deviation equal
#' # to 2.
#' # Solve 2 * p(z') + 2 - y_star = 0 for z'
#' # p(z') = c0 + c1 * z' + c2 * z'^2 + c3 * z'^3 + c4 * z'^4 + c5 * z'^5
#'
#' f_exp <- function(z, c, y) {
#'   return(2 * (c[1] + c[2] * z + c[3] * z^2 + c[4] * z^3 + c[5] * z^4 +
#'               c[6] * z^5) + 2 - y)
#' }
#'
#' z_prime <- uniroot(f_exp, interval = c(-1e06, 1e06),
#'                    c = as.numeric(H_exp$constants), y = y_star)$root
#' z_prime
#' # [1] 1.644926
#'
#' # 6) Calculate 1 - Phi(z'), the corresponding probability for the
#' # approximation Y to Y* (i.e. 1 - Phi(z') = 0.05), and compare to target
#' # value alpha:
#'
#' 1 - pnorm(z_prime)
#' # [1] 0.04999249
#'
#' # 7) Plot a parametric graph of Y* and Y:
#'
#' plot_sim_pdf_theory(sim_y = as.numeric(H_exp$continuous_variable[, 1]),
#'                     Dist = "Exponential", params = 0.5)
#'
#' # Note we can also plot the empirical cdf and show the cumulative
#' # probability up to y_star:
#'
#' plot_sim_cdf(sim_y = as.numeric(H_exp$continuous_variable[, 1]),
#'              calc_cprob = TRUE, delta = y_star)
#'
#' }
nonnormvar1 <- function(method = c("Fleishman", "Polynomial"), means = 0,
                        vars = 1, skews = 0, skurts = 0,
                        fifths = 0, sixths = 0, Six = NULL,
                        cstart = NULL, n = 10000, seed = 1234) {
  start.time <- Sys.time()
  start.time.constants <- Sys.time()
  cons <- suppressWarnings(find_constants(method = method, skews = skews,
                                          skurts = skurts, fifths = fifths,
                                          sixths = sixths, Six = Six,
                                          cstart = cstart, n = 25,
                                          seed = seed))
  if (length(cons) == 1 | is.null(cons))
    stop(paste("\nConstants can not be found for continuous variable 1.",
               sep = ""))
  con_solution <- cons$constants
  SixCorr <- cons$SixCorr1
  constants <- matrix(con_solution, nrow = 1)
  cat("Constants: Distribution ", 1, " \n")
  stop.time.constants <- Sys.time()
  set.seed(seed)
  X_cont <- matrix(scale(rnorm(n, 0, 1)), nrow = n, ncol = 1)
  Y <- matrix(1, nrow = n, ncol = 1)
  Yb <- matrix(1, nrow = n, ncol = 1)
  if (method == "Fleishman") {
    Y[, 1] <- constants[1, 1] + constants[1, 2] * X_cont[, 1] +
      constants[1, 3] * X_cont[, 1]^2 + constants[1, 4] * X_cont[, 1]^3
    colnames(constants) <- c("c0", "c1", "c2", "c3")
  }
  if (method == "Polynomial") {
    Y[, 1] <- constants[1, 1] + constants[1, 2] * X_cont[, 1] +
      constants[1, 3] * X_cont[, 1]^2 + constants[1, 4] * X_cont[, 1]^3 +
      constants[1, 5] * X_cont[, 1]^4 + constants[1, 6] * X_cont[, 1]^5
    colnames(constants) <- c("c0", "c1", "c2", "c3", "c4", "c5")
  }
  Yb[, 1] <- means + sqrt(vars) * Y[, 1]
  cont_sum <- describe(Yb, type = 1)
  sim_fifths <- matrix(calc_moments(Yb[, 1])[5], nrow = 1, ncol = 1)
  sim_sixths <- matrix(calc_moments(Yb[, 1])[6], nrow = 1, ncol = 1)
  cont_sum <- as.data.frame(cbind(1, cont_sum[, -c(1, 6, 7, 10, 13)],
                                  sim_fifths, sim_sixths))
  colnames(cont_sum) <- c("Distribution", "n", "mean", "sd", "median", "min",
                          "max", "skew", "skurtosis", "fifth", "sixth")
  if (method == "Fleishman") {
    target_sum <- as.data.frame(cbind(1, means, sqrt(vars), skews, skurts))
    colnames(target_sum) <- c("Distribution", "mean", "sd", "skew", "skurtosis")
  } else {
    target_sum <- as.data.frame(cbind(1, means, sqrt(vars), skews, skurts,
                                      fifths, sixths))
    colnames(target_sum) <- c("Distribution", "mean", "sd", "skew", "skurtosis",
                              "fifth", "sixth")
  }
  stop.time <- Sys.time()
  Time.constants <- round(difftime(stop.time.constants, start.time.constants,
                                   units = "min"), 3)
  cat("\nConstants calculation time:", Time.constants, "minutes \n")
  Time <- round(difftime(stop.time, start.time, units = "min"), 3)
  cat("Total Simulation time:", Time, "minutes \n")
  result <- list(constants = as.data.frame(constants),
                 continuous_variable = as.data.frame(Yb),
                 summary_continuous = cont_sum,
                 summary_targetcont = target_sum,
                 sixth_correction = SixCorr,
                 valid.pdf = cons$valid, Constants_Time = Time.constants,
                 Simulation_Time = Time)
  result
}