nonnormvar1: Generation of One Non-Normal Continuous Variable Using the...

Description Usage Arguments Value Choice of Fleishman's third-order or Headrick's fifth-order method Overview of Simulation Process Reasons for Function Errors References See Also Examples

View source: R/nonnormvar1.R

Description

This function simulates one non-normal continuous variable using either Fleishman's Third-Order (method = "Fleishman", doi: 10.1007/BF02293811) or Headrick's Fifth-Order (method = "Polynomial", doi: 10.1016/S0167-9473(02)00072-5) Polynomial Transformation. If only one variable is desired and that variable is continuous, this function should be used. The power method transformation is a computationally efficient algorithm that simulates continuous distributions through the method of moments. It works by matching standardized cumulants – the first four (mean, variance, skew, and standardized kurtosis) for Fleishman's method, or the first six (mean, variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick's method. The transformation is expressed as follows:

Y = c_{0} + c_{1} * Z + c_{2} * Z^2 + c_{3} * Z^3 + c_{4} * Z^4 + c_{5} * Z^5,

where Z ~ N(0,1), and c_{4} and c_{5} both equal 0 for Fleishman's method. The real constants are calculated by find_constants. All variables are simulated with mean 0 and variance 1, and then transformed to the specified mean and variance at the end.

The required parameters for simulating continuous variables include: mean, variance, skewness, standardized kurtosis (kurtosis - 3), and standardized fifth and sixth cumulants (for method = "Polynomial"). If the goal is to simulate a theoretical distribution (i.e. Gamma, Beta, Logistic, etc.), these values can be obtained using calc_theory. If the goal is to mimic an empirical data set, these values can be found using calc_moments (using the method of moments) or calc_fisherk (using Fisher's k-statistics). If the standardized cumulants are obtained from calc_theory, the user may need to use rounded values as inputs (i.e. skews = round(skews, 8)). Due to the nature of the integration involved in calc_theory, the results are approximations. Greater accuracy can be achieved by increasing the number of subdivisions (sub) used in the integration process. For example, in order to ensure that skew is exactly 0 for symmetric distributions.

For some sets of cumulants, it is either not possible to find power method constants or the calculated constants do not generate valid power method pdfs. In these situations, adding a value to the sixth cumulant may provide solutions (see find_constants). If simulation results indicate that a continuous variable does not generate a valid pdf, the user can try find_constants with various sixth cumulant correction vectors to determine if a valid pdf can be found.

Headrick & Kowalchuk (2007, doi: 10.1080/10629360600605065) outlined a general method for comparing a simulated distribution Y to a given theoretical distribution Y^*. These steps can be found in the example and the Comparison of Simulated Distribution to Theoretical Distribution or Empirical Data vignette.

Usage

1
2
3
nonnormvar1(method = c("Fleishman", "Polynomial"), means = 0, vars = 1,
  skews = 0, skurts = 0, fifths = 0, sixths = 0, Six = NULL,
  cstart = NULL, n = 10000, seed = 1234)

Arguments

method

the method used to generate the continuous variable. "Fleishman" uses Fleishman's third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation.

means

mean for the continuous variable (default = 0)

vars

variance (default = 1)

skews

skewness value (default = 0)

skurts

standardized kurtosis (kurtosis - 3, so that normal variables have a value of 0; default = 0)

fifths

standardized fifth cumulant (not necessary for method = "Fleishman"; default = 0)

sixths

standardized sixth cumulant (not necessary for method = "Fleishman"; default = 0)

Six

a vector of correction values to add to the sixth cumulant if no valid pdf constants are found, ex: Six = seq(0.01, 2, by = 0.01); if no correction is desired, set Six = NULL (default)

cstart

initial values for root-solving algorithm (see multiStart for method = "Fleishman" or nleqslv for method = "Polynomial"). If user specified, must be input as a matrix. If NULL and all 4 standardized cumulants (rounded to 3 digits) are within 0.01 of those in Headrick's common distribution table (see Headrick.dist data), uses his constants as starting values; else, generates n sets of random starting values from uniform distributions.

n

the sample size (i.e. the length of the simulated variable; default = 10000)

seed

the seed value for random number generation (default = 1234)

Value

A list with the following components:

constants a data.frame of the constants

continuous_variable a data.frame of the generated continuous variable

summary_continuous a data.frame containing a summary of the variable

summary_targetcont a data.frame containing a summary of the target variable

sixth_correction the sixth cumulant correction value

valid.pdf "TRUE" if constants generate a valid pdf, else "FALSE"

Constants_Time the time in minutes required to calculate the constants

Simulation_Time the total simulation time in minutes

Choice of Fleishman's third-order or Headrick's fifth-order method

Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving accuracy. In addition, the range of feasible standardized kurtosis values, given skew and standardized fifth (γ_{3}) and sixth (γ_{4}) cumulants, is larger than with Fleishman's method (see calc_lower_skurt). For example, the Fleishman method can not be used to generate a non-normal distribution with a ratio of γ_{3}^2/γ_{4} > 9/14 (see Headrick & Kowalchuk, 2007). This eliminates the Chi-squared family of distributions, which has a constant ratio of γ_{3}^2/γ_{4} = 2/3. However, if the fifth and sixth cumulants do not exist, the Fleishman approximation should be used.

Overview of Simulation Process

1) The constants are calculated for the continuous variable using find_constants. If no solutions are found that generate a valid power method pdf, the function will return constants that produce an invalid pdf (or a stop error if no solutions can be found). Possible solutions include: 1) changing the seed, or 2) using a Six vector of sixth cumulant correction values (if method = "Polynomial"). Errors regarding constant calculation are the most probable cause of function failure.

2) An intermediate standard normal variate X of length n is generated.

3) Summary statistics are calculated.

Reasons for Function Errors

1) The most likely cause for function errors is that no solutions to fleish or poly converged when using find_constants. If this happens, the simulation will stop. It may help to first use find_constants for each continuous variable to determine if a vector of sixth cumulant correction values is needed. The solutions can be used as starting values (see cstart below). If the standardized cumulants are obtained from calc_theory, the user may need to use rounded values as inputs (i.e. skews = round(skews, 8)).

2) In addition, the kurtosis may be outside the region of possible values. There is an associated lower boundary for kurtosis associated with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method). Use calc_lower_skurt to determine the boundary for a given set of cumulants.

References

Fleishman AI (1978). A Method for Simulating Non-normal Distributions. Psychometrika, 43, 521-532. doi: 10.1007/BF02293811.

Headrick TC (2002). Fast Fifth-order Polynomial Transforms for Generating Univariate and Multivariate Non-normal Distributions. Computational Statistics & Data Analysis, 40(4):685-711. doi: 10.1016/S0167-9473(02)00072-5. (ScienceDirect)

Headrick TC (2004). On Polynomial Transformations for Simulating Multivariate Nonnormal Distributions. Journal of Modern Applied Statistical Methods, 3(1), 65-71. doi: 10.22237/jmasm/1083370080.

Headrick TC, Kowalchuk RK (2007). The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data. Journal of Statistical Computation and Simulation, 77, 229-249. doi: 10.1080/10629360600605065.

Headrick TC, Sawilowsky SS (1999). Simulating Correlated Non-normal Distributions: Extending the Fleishman Power Method. Psychometrika, 64, 25-35. doi: 10.1007/BF02294317.

Headrick TC, Sheng Y, & Hodis FA (2007). Numerical Computing and Graphics for the Power Method Transformation Using Mathematica. Journal of Statistical Software, 19(3), 1 - 17. doi: 10.18637/jss.v019.i03.

See Also

find_constants

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Normal distribution with Headrick's fifth-order PMT:
N <- nonnormvar1("Polynomial", 0, 1, 0, 0, 0, 0)

## Not run: 
# Use Headrick & Kowalchuk's (2007) steps to compare a simulated exponential
# (mean = 2) variable to the theoretical exponential(mean = 2) density:

# 1) Obtain the standardized cumulants:
stcums <- calc_theory(Dist = "Exponential", params = 0.5) # rate = 1/mean

# 2) Simulate the variable:
H_exp <- nonnormvar1("Polynomial", means = 2, vars = 2, skews = stcums[3],
                    skurts = stcums[4], fifths = stcums[5],
                    sixths = stcums[6], n = 10000, seed = 1234)

H_exp$constants
#           c0        c1       c2         c3          c4           c5
# 1 -0.3077396 0.8005605 0.318764 0.03350012 -0.00367481 0.0001587076

# 3) Determine whether the constants produce a valid power method pdf:

H_exp$valid.pdf
# [1] "TRUE"

# 4) Select a critical value:

# Let alpha = 0.05.
y_star <- qexp(1 - 0.05, rate = 0.5) # note that rate = 1/mean
y_star
# [1] 5.991465

# 5) Solve m_{2}^{1/2} * p(z') + m_{1} - y* = 0 for z', where m_{1} and
# m_{2} are the 1st and 2nd moments of Y*:

# The exponential(2) distribution has a mean and standard deviation equal
# to 2.
# Solve 2 * p(z') + 2 - y_star = 0 for z'
# p(z') = c0 + c1 * z' + c2 * z'^2 + c3 * z'^3 + c4 * z'^4 + c5 * z'^5

f_exp <- function(z, c, y) {
  return(2 * (c[1] + c[2] * z + c[3] * z^2 + c[4] * z^3 + c[5] * z^4 +
              c[6] * z^5) + 2 - y)
}

z_prime <- uniroot(f_exp, interval = c(-1e06, 1e06),
                   c = as.numeric(H_exp$constants), y = y_star)$root
z_prime
# [1] 1.644926

# 6) Calculate 1 - Phi(z'), the corresponding probability for the
# approximation Y to Y* (i.e. 1 - Phi(z') = 0.05), and compare to target
# value alpha:

1 - pnorm(z_prime)
# [1] 0.04999249

# 7) Plot a parametric graph of Y* and Y:

plot_sim_pdf_theory(sim_y = as.numeric(H_exp$continuous_variable[, 1]),
                    Dist = "Exponential", params = 0.5)

# Note we can also plot the empirical cdf and show the cumulative
# probability up to y_star:

plot_sim_cdf(sim_y = as.numeric(H_exp$continuous_variable[, 1]),
             calc_cprob = TRUE, delta = y_star)


## End(Not run)

SimMultiCorrData documentation built on May 2, 2019, 9:50 a.m.