Description Usage Arguments Value Reasons for Function Errors The Generate, Sort, and Correlate (GSC, Demirtas & Hedeker, 2011, doi: 10.1198/tast.2011.10090) Algorithm Ordinal Variables Continuous Variables Poisson Variables Negative Binomial Variables Continuous - Ordinal Pairs Ordinal - Poisson Pairs Ordinal - Negative Binomial Pairs Continuous - Poisson Pairs Continuous - Negative Binomial Pairs Poisson - Negative Binomial Pairs References See Also Examples
This function calculates the lower and upper correlation bounds for the given distributions and
checks if a given target correlation matrix rho is within the bounds. It should be used before simulation with
rcorrvar2
. However, even if all pairwise correlations fall within the bounds, it is still possible
that the desired correlation matrix is not feasible. This is particularly true when ordinal variables (r >= 2 categories) are
generated or negative correlations are desired. Therefore, this function should be used as a general check to eliminate pairwise correlations that are obviously
not reproducible. It will help prevent errors when executing the simulation.
Note: Some pieces of the function code have been adapted from Demirtas, Hu, & Allozi's (2017) validation_specs
.
This function (valid_corr2
) extends the methods to:
1) non-normal continuous variables generated by Fleishman's third-order or Headrick's fifth-order polynomial transformation method,
2) Negative Binomial variables (including all pairwise correlations involving them), and
3) Count variables are treated as ordinal when calculating the bounds since that is the intermediate correlation calculation method.
Please see the Comparison of Method 1 and Method 2 vignette for more information regarding method 2.
1 2 3 4 5 6 | valid_corr2(k_cat = 0, k_cont = 0, k_pois = 0, k_nb = 0,
method = c("Fleishman", "Polynomial"), means = NULL, vars = NULL,
skews = NULL, skurts = NULL, fifths = NULL, sixths = NULL,
Six = list(), marginal = list(), lam = NULL, pois_eps = NULL,
size = NULL, prob = NULL, mu = NULL, nb_eps = NULL, rho = NULL,
n = 100000, seed = 1234)
|
k_cat |
the number of ordinal (r >= 2 categories) variables (default = 0) |
k_cont |
the number of continuous variables (default = 0) |
k_pois |
the number of Poisson variables (default = 0) |
k_nb |
the number of Negative Binomial variables (default = 0) |
method |
the method used to generate the k_cont continuous variables. "Fleishman" uses a third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation. |
means |
a vector of means for the k_cont continuous variables (i.e. = rep(0, k_cont)) |
vars |
a vector of variances (i.e. = rep(1, k_cont)) |
skews |
a vector of skewness values (i.e. = rep(0, k_cont)) |
skurts |
a vector of standardized kurtoses (kurtosis - 3, so that normal variables have a value of 0; i.e. = rep(0, k_cont)) |
fifths |
a vector of standardized fifth cumulants (not necessary for |
sixths |
a vector of standardized sixth cumulants (not necessary for |
Six |
a list of vectors of correction values to add to the sixth cumulants if no valid pdf constants are found,
ex: |
marginal |
a list of length equal to |
lam |
a vector of lambda (> 0) constants for the Poisson variables (see |
pois_eps |
a vector of length |
size |
a vector of size parameters for the Negative Binomial variables (see |
prob |
a vector of success probability parameters |
mu |
a vector of mean parameters (*Note: either |
nb_eps |
a vector of length |
rho |
the target correlation matrix (must be ordered ordinal, continuous, Poisson, Negative Binomial; default = NULL) |
n |
the sample size (i.e. the length of each simulated variable; default = 100000) |
seed |
the seed value for random number generation (default = 1234) |
A list with components:
L_rho
the lower correlation bound
U_rho
the upper correlation bound
If continuous variables are desired, additional components are:
constants
the calculated constants
sixth_correction
a vector of the sixth cumulant correction values
valid.pdf
a vector with i-th component equal to "TRUE" if variable Y_i has a valid power method pdf, else "FALSE"
If a target correlation matrix rho is provided, each pairwise correlation is checked to see if it is within the lower and upper bounds. If the correlation is outside the bounds, the indices of the variable pair are given.
1) The most likely cause for function errors is that no solutions to fleish
or
poly
converged when using find_constants
. If this happens,
the simulation will stop. It may help to first use find_constants
for each continuous variable to
determine if a vector of sixth cumulant correction values is needed. If the standardized cumulants are obtained from calc_theory
,
the user may need to use rounded values as inputs (i.e.
skews = round(skews, 8)
). Due to the nature of the integration involved in calc_theory
, the results are
approximations. Greater accuracy can be achieved by increasing the number of subdivisions (sub
) used in the integration
process. For example, in order to ensure that skew is exactly 0 for symmetric distributions.
2) In addition, the kurtosis may be outside the region of possible values. There is an associated lower boundary for kurtosis associated
with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method). Use
calc_lower_skurt
to determine the boundary for a given set of cumulants.
The GSC algorithm is a flexible method for determining empirical correlation bounds when the theoretical bounds are unknown. The steps are as follows:
1) Generate independent random samples from the desired distributions using a large number of observations (i.e. N = 100,000).
2) Lower Bound: Sort the two variables in opposite directions (i.e., one increasing and one decreasing) and find the sample correlation.
3) Upper Bound: Sort the two variables in the same direction and find the sample correlation.
Demirtas & Hedeker showed that the empirical bounds computed from the GSC method are similar to the theoretical bounds (when they are known).
The processes used to find the correlation bounds for each variable type are described below:
Binary pairs: The correlation bounds are determined as in Demirtas et al. (2012, doi: 10.1002/sim.5362), who used the method of Emrich & Piedmonte (1991, doi: 10.1080/00031305.1991.10475828). The joint distribution is determined by "borrowing" the moments of a multivariate normal distribution. For two binary variables Y_{i} and Y_{j}, with success probabilities p_{i} and p_{j}, the lower correlation bound is given by
max(-√{(p_{i}p_{j})/(q_{i}q_{j})},\ -√{(q_{i}q_{j})/(p_{i}p_{j})})
and the upper bound by
min(√{(p_{i}q_{j})/(q_{i}p_{j})},\ √{(q_{i}p_{j})/(p_{i}q_{j})})
Here, q_{i} = 1 - p_{i} and q_{j} = 1 - p_{j}.
Binary-Ordinal or Ordinal-Ordinal pairs: Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
Continuous variables are randomly generated using constants from find_constants
and a vector of sixth
cumulant correction values (if provided.) The GSC algorithm is used to find the lower and upper bounds.
The maximum support values, given the vector of cumulative probability truncation values (pois_eps) and vector of means (lam), are calculated using
max_count_support
. The finite supports are used to determine marginal distributions for each Poisson variable.
Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
The maximum support values, given the vector of cumulative probability truncation values (nb_eps) and vectors of
sizes and success probabilities (prob) or means (mu), are calculated using max_count_support
.
The finite supports are used to determine marginal distributions for each Negative Binomial variable.
Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
Randomly generated ordinal variables with the given marginal distributions and the previously generated continuous variables are used in the GSC algorithm to find the correlation bounds.
Randomly generated ordinal and Poisson variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
Randomly generated ordinal and Negative Binomial variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
The previously generated continuous variables and randomly generated Poisson variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
The previously generated continuous variables and randomly generated Negative Binomial variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
Please see rcorrvar2
for additional references.
Demirtas H & Hedeker D (2011). A practical way for computing approximate lower and upper correlation bounds. American Statistician, 65(2): 104-109. doi: 10.1198/tast.2011.10090.
Demirtas H, Hedeker D, & Mermelstein RJ (2012). Simulation of massive public health data by power polynomials. Statistics in Medicine, 31(27): 3337-3346. doi: 10.1002/sim.5362.
Emrich LJ & Piedmonte MR (1991). A Method for Generating High-Dimensional Multivariate Binary Variables. The American Statistician, 45(4): 302-4. doi: 10.1080/00031305.1991.10475828.
Frechet M. Sur les tableaux de correlation dont les marges sont donnees. Ann. l'Univ. Lyon SectA. 1951;14:53-77.
Hoeffding W. Scale-invariant correlation theory. In: Fisher NI, Sen PK, editors. The collected works of Wassily Hoeffding. New York: Springer-Verlag; 1994. p. 57-107.
Hakan Demirtas, Yiran Hu and Rawan Allozi (2017). PoisBinOrdNor: Data Generation with Poisson, Binary, Ordinal and Normal Components. R package version 1.4. https://CRAN.R-project.org/package=PoisBinOrdNor
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | valid_corr2(n = 1000, k_cat = 1, k_cont = 1, method = "Polynomial",
means = 0, vars = 1, skews = 0, skurts = 0, fifths = 0, sixths = 0,
marginal = list(c(1/3, 2/3)), rho = matrix(c(1, 0.4, 0.4, 1), 2, 2))
## Not run:
# Binary, Ordinal, Continuous, Poisson, and Negative Binomial Variables
options(scipen = 999)
seed <- 1234
n <- 10000
# Continuous Distributions: Normal, t (df = 10), Chisq (df = 4),
# Beta (a = 4, b = 2), Gamma (a = 4, b = 4)
Dist <- c("Gaussian", "t", "Chisq", "Beta", "Gamma")
# calculate standardized cumulants
# those for the normal and t distributions are rounded to ensure the
# correct values (i.e. skew = 0)
M1 <- round(calc_theory(Dist = "Gaussian", params = c(0, 1)), 8)
M2 <- round(calc_theory(Dist = "t", params = 10), 8)
M3 <- calc_theory(Dist = "Chisq", params = 4)
M4 <- calc_theory(Dist = "Beta", params = c(4, 2))
M5 <- calc_theory(Dist = "Gamma", params = c(4, 4))
M <- cbind(M1, M2, M3, M4, M5)
M <- round(M[-c(1:2),], digits = 6)
colnames(M) <- Dist
rownames(M) <- c("skew", "skurtosis", "fifth", "sixth")
means <- rep(0, length(Dist))
vars <- rep(1, length(Dist))
# Binary and Ordinal Distributions
marginal <- list(0.3, 0.4, c(0.1, 0.5), c(0.3, 0.6, 0.9),
c(0.2, 0.4, 0.7, 0.8))
support <- list()
# Poisson Distributions
lam <- c(1, 5, 10)
# Negative Binomial Distributions
size <- c(3, 6)
prob <- c(0.2, 0.8)
ncat <- length(marginal)
ncont <- ncol(M)
npois <- length(lam)
nnb <- length(size)
# Create correlation matrix from a uniform distribution (-0.8, 0.8)
set.seed(seed)
Rey <- diag(1, nrow = (ncat + ncont + npois + nnb))
for (i in 1:nrow(Rey)) {
for (j in 1:ncol(Rey)) {
if (i > j) Rey[i, j] <- runif(1, -0.8, 0.8)
Rey[j, i] <- Rey[i, j]
}
}
# Test for positive-definiteness
library(Matrix)
if(min(eigen(Rey, symmetric = TRUE)$values) < 0) {
Rey <- as.matrix(nearPD(Rey, corr = T, keepDiag = T)$mat)
}
# Make sure Rey is within upper and lower correlation limits
valid <- valid_corr2(k_cat = ncat, k_cont = ncont, k_pois = npois,
k_nb = nnb, method = "Polynomial", means = means,
vars = vars, skews = M[1, ], skurts = M[2, ],
fifths = M[3, ], sixths = M[4, ], marginal = marginal,
lam = lam, pois_eps = rep(0.0001, npois),
size = size, prob = prob, nb_eps = rep(0.0001, nnb),
rho = Rey, seed = seed)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.