valid_corr: Determine Correlation Bounds for Ordinal, Continuous,...

Description Usage Arguments Value Reasons for Function Errors The Generate, Sort, and Correlate (GSC, Demirtas & Hedeker, 2011, doi: 10.1198/tast.2011.10090) Algorithm The Frechet-Hoeffding Correlation Bounds Ordinal Variables Continuous Variables Poisson Variables Negative Binomial Variables Continuous - Ordinal Pairs Ordinal - Poisson Pairs Ordinal - Negative Binomial Pairs Continuous - Poisson Pairs Continuous - Negative Binomial Pairs Poisson - Negative Binomial Pairs References See Also Examples

Description

This function calculates the lower and upper correlation bounds for the given distributions and checks if a given target correlation matrix rho is within the bounds. It should be used before simulation with rcorrvar. However, even if all pairwise correlations fall within the bounds, it is still possible that the desired correlation matrix is not feasible. This is particularly true when ordinal variables (r >= 2 categories) are generated or negative correlations are desired. Therefore, this function should be used as a general check to eliminate pairwise correlations that are obviously not reproducible. It will help prevent errors when executing the simulation.

Note: Some pieces of the function code have been adapted from Demirtas, Hu, & Allozi's (2017) validation_specs. This function (valid_corr) extends the methods to:

1) non-normal continuous variables generated by Fleishman's third-order or Headrick's fifth-order polynomial transformation method, and

2) Negative Binomial variables (including all pairwise correlations involving them).

Please see the Comparison of Method 1 and Method 2 vignette for more information regarding method 1.

Usage

1
2
3
4
5
valid_corr(k_cat = 0, k_cont = 0, k_pois = 0, k_nb = 0,
  method = c("Fleishman", "Polynomial"), means = NULL, vars = NULL,
  skews = NULL, skurts = NULL, fifths = NULL, sixths = NULL,
  Six = list(), marginal = list(), lam = NULL, size = NULL,
  prob = NULL, mu = NULL, rho = NULL, n = 100000, seed = 1234)

Arguments

k_cat

the number of ordinal (r >= 2 categories) variables (default = 0)

k_cont

the number of continuous variables (default = 0)

k_pois

the number of Poisson variables (default = 0)

k_nb

the number of Negative Binomial variables (default = 0)

method

the method used to generate the k_cont continuous variables. "Fleishman" uses a third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation.

means

a vector of means for the k_cont continuous variables (i.e. = rep(0, k_cont))

vars

a vector of variances (i.e. = rep(1, k_cont))

skews

a vector of skewness values (i.e. = rep(0, k_cont))

skurts

a vector of standardized kurtoses (kurtosis - 3, so that normal variables have a value of 0; i.e. = rep(0, k_cont))

fifths

a vector of standardized fifth cumulants (not necessary for method = "Fleishman"; i.e. = rep(0, k_cont))

sixths

a vector of standardized sixth cumulants (not necessary for method = "Fleishman"; i.e. = rep(0, k_cont))

Six

a list of vectors of correction values to add to the sixth cumulants if no valid pdf constants are found, ex: Six = list(seq(0.01, 2,by = 0.01), seq(1, 10,by = 0.5)); if no correction is desired for variable Y_i, set the i-th list component equal to NULL

marginal

a list of length equal to k_cat; the i-th element is a vector of the cumulative probabilities defining the marginal distribution of the i-th variable; if the variable can take r values, the vector will contain r - 1 probabilities (the r-th is assumed to be 1; default = list())

lam

a vector of lambda (> 0) constants for the Poisson variables (see Poisson)

size

a vector of size parameters for the Negative Binomial variables (see NegBinomial)

prob

a vector of success probability parameters

mu

a vector of mean parameters (*Note: either prob or mu should be supplied for all Negative Binomial variables, not a mixture; default = NULL)

rho

the target correlation matrix (must be ordered ordinal, continuous, Poisson, Negative Binomial; default = NULL)

n

the sample size (i.e. the length of each simulated variable; default = 100000)

seed

the seed value for random number generation (default = 1234)

Value

A list with components:

L_rho the lower correlation bound

U_rho the upper correlation bound

If continuous variables are desired, additional components are:

constants the calculated constants

sixth_correction a vector of the sixth cumulant correction values

valid.pdf a vector with i-th component equal to "TRUE" if variable Y_i has a valid power method pdf, else "FALSE"

If a target correlation matrix rho is provided, each pairwise correlation is checked to see if it is within the lower and upper bounds. If the correlation is outside the bounds, the indices of the variable pair are given.

Reasons for Function Errors

1) The most likely cause for function errors is that no solutions to fleish or poly converged when using find_constants. If this happens, the simulation will stop. It may help to first use find_constants for each continuous variable to determine if a vector of sixth cumulant correction values is needed. If the standardized cumulants are obtained from calc_theory, the user may need to use rounded values as inputs (i.e. skews = round(skews, 8)). Due to the nature of the integration involved in calc_theory, the results are approximations. Greater accuracy can be achieved by increasing the number of subdivisions (sub) used in the integration process. For example, in order to ensure that skew is exactly 0 for symmetric distributions.

2) In addition, the kurtosis may be outside the region of possible values. There is an associated lower boundary for kurtosis associated with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method). Use calc_lower_skurt to determine the boundary for a given set of cumulants.

The Generate, Sort, and Correlate (GSC, Demirtas & Hedeker, 2011, doi: 10.1198/tast.2011.10090) Algorithm

The GSC algorithm is a flexible method for determining empirical correlation bounds when the theoretical bounds are unknown. The steps are as follows:

1) Generate independent random samples from the desired distributions using a large number of observations (i.e. N = 100,000).

2) Lower Bound: Sort the two variables in opposite directions (i.e., one increasing and one decreasing) and find the sample correlation.

3) Upper Bound: Sort the two variables in the same direction and find the sample correlation.

Demirtas & Hedeker showed that the empirical bounds computed from the GSC method are similar to the theoretical bounds (when they are known).

The Frechet-Hoeffding Correlation Bounds

Suppose two random variables Y_{i} and Y_{j} have cumulative distribution functions given by F_{i} and F_{j}. Let U be a uniform(0,1) random variable, i.e. representing the distribution of the standard normal cdf. Then Hoeffing (1940) and Frechet (1951) showed that bounds for the correlation between Y_{i} and Y_{j} are given by

(corr(F_{i}^{-1}(U), F_{j}^{-1}(1-U)), corr(F_{i}^{-1}(U), F_{j}^{-1}(U)))

The processes used to find the correlation bounds for each variable type are described below:

Ordinal Variables

Binary pairs: The correlation bounds are determined as in Demirtas et al. (2012, doi: 10.1002/sim.5362), who used the method of Emrich & Piedmonte (1991, doi: 10.1080/00031305.1991.10475828). The joint distribution is determined by "borrowing" the moments of a multivariate normal distribution. For two binary variables Y_{i} and Y_{j}, with success probabilities p_{i} and p_{j}, the lower correlation bound is given by

max(-√{(p_{i}p_{j})/(q_{i}q_{j})},\ -√{(q_{i}q_{j})/(p_{i}p_{j})})

and the upper bound by

min(√{(p_{i}q_{j})/(q_{i}p_{j})},\ √{(q_{i}p_{j})/(p_{i}q_{j})})

Here, q_{i} = 1 - p_{i} and q_{j} = 1 - p_{j}.

Binary-Ordinal or Ordinal-Ordinal pairs: Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

Continuous Variables

Continuous variables are randomly generated using constants from find_constants and a vector of sixth cumulant correction values (if provided.) The GSC algorithm is used to find the lower and upper bounds.

Poisson Variables

Poisson variables with the given means (lam) are randomly generated using the inverse cdf method. The Frechet-Hoeffding bounds are used for the correlation bounds.

Negative Binomial Variables

Negative Binomial variables with the given sizes and success probabilities (prob) or means (mu) are randomly generated using the inverse cdf method. The Frechet-Hoeffding bounds are used for the correlation bounds.

Continuous - Ordinal Pairs

Randomly generated ordinal variables with the given marginal distributions and the previously generated continuous variables are used in the GSC algorithm to find the correlation bounds.

Ordinal - Poisson Pairs

Randomly generated ordinal variables with the given marginal distributions and randomly generated Poisson variables with the given means (lam) are used in the GSC algorithm to find the correlation bounds.

Ordinal - Negative Binomial Pairs

Randomly generated ordinal variables with the given marginal distributions and randomly generated Negative Binomial variables with the given sizes and success probabilities (prob) or means (mu) are used in the GSC algorithm to find the correlation bounds.

Continuous - Poisson Pairs

The previously generated continuous variables and randomly generated Poisson variables with the given means (lam) are used in the GSC algorithm to find the correlation bounds.

Continuous - Negative Binomial Pairs

The previously generated continuous variables and randomly generated Negative Binomial variables with the given sizes and success probabilities (prob) or means (mu) are used in the GSC algorithm to find the correlation bounds.

Poisson - Negative Binomial Pairs

Poisson variables with the given means (lam) and Negative Binomial variables with the given sizes and success probabilities (prob) or means (mu) are randomly generated using the inverse cdf method. The Frechet-Hoeffding bounds are used for the correlation bounds.

References

Please see rcorrvar for additional references.

Demirtas H & Hedeker D (2011). A practical way for computing approximate lower and upper correlation bounds. American Statistician, 65(2): 104-109. doi: 10.1198/tast.2011.10090.

Demirtas H, Hedeker D, & Mermelstein RJ (2012). Simulation of massive public health data by power polynomials. Statistics in Medicine, 31(27): 3337-3346. doi: 10.1002/sim.5362.

Emrich LJ & Piedmonte MR (1991). A Method for Generating High-Dimensional Multivariate Binary Variables. The American Statistician, 45(4): 302-4. doi: 10.1080/00031305.1991.10475828.

Frechet M. Sur les tableaux de correlation dont les marges sont donnees. Ann. l'Univ. Lyon SectA. 1951;14:53-77.

Hoeffding W. Scale-invariant correlation theory. In: Fisher NI, Sen PK, editors. The collected works of Wassily Hoeffding. New York: Springer-Verlag; 1994. p. 57-107.

Hakan Demirtas, Yiran Hu and Rawan Allozi (2017). PoisBinOrdNor: Data Generation with Poisson, Binary, Ordinal and Normal Components. R package version 1.4. https://CRAN.R-project.org/package=PoisBinOrdNor

See Also

find_constants, rcorrvar

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
valid_corr(n = 1000, k_cat = 1, k_cont = 1, method = "Polynomial",
  means = 0, vars = 1, skews = 0, skurts = 0, fifths = 0, sixths = 0,
  marginal = list(c(1/3, 2/3)), rho = matrix(c(1, 0.4, 0.4, 1), 2, 2))

## Not run: 

# Binary, Ordinal, Continuous, Poisson, and Negative Binomial Variables

options(scipen = 999)
seed <- 1234
n <- 10000

# Continuous Distributions: Normal, t (df = 10), Chisq (df = 4),
#                           Beta (a = 4, b = 2), Gamma (a = 4, b = 4)
Dist <- c("Gaussian", "t", "Chisq", "Beta", "Gamma")

# calculate standardized cumulants
# those for the normal and t distributions are rounded to ensure the
# correct values (i.e. skew = 0)

M1 <- round(calc_theory(Dist = "Gaussian", params = c(0, 1)), 8)
M2 <- round(calc_theory(Dist = "t", params = 10), 8)
M3 <- calc_theory(Dist = "Chisq", params = 4)
M4 <- calc_theory(Dist = "Beta", params = c(4, 2))
M5 <- calc_theory(Dist = "Gamma", params = c(4, 4))
M <- cbind(M1, M2, M3, M4, M5)
M <- round(M[-c(1:2),], digits = 6)
colnames(M) <- Dist
rownames(M) <- c("skew", "skurtosis", "fifth", "sixth")
means <- rep(0, length(Dist))
vars <- rep(1, length(Dist))

# Binary and Ordinal Distributions
marginal <- list(0.3, 0.4, c(0.1, 0.5), c(0.3, 0.6, 0.9),
                 c(0.2, 0.4, 0.7, 0.8))
support <- list()

# Poisson Distributions
lam <- c(1, 5, 10)

# Negative Binomial Distributions
size <- c(3, 6)
prob <- c(0.2, 0.8)

ncat <- length(marginal)
ncont <- ncol(M)
npois <- length(lam)
nnb <- length(size)

# Create correlation matrix from a uniform distribution (-0.8, 0.8)
set.seed(seed)
Rey <- diag(1, nrow = (ncat + ncont + npois + nnb))
for (i in 1:nrow(Rey)) {
  for (j in 1:ncol(Rey)) {
    if (i > j) Rey[i, j] <- runif(1, -0.8, 0.8)
    Rey[j, i] <- Rey[i, j]
  }
}

# Test for positive-definiteness
library(Matrix)
if(min(eigen(Rey, symmetric = TRUE)$values) < 0) {
  Rey <- as.matrix(nearPD(Rey, corr = T, keepDiag = T)$mat)
}

# Make sure Rey is within upper and lower correlation limits
valid <- valid_corr(k_cat = ncat, k_cont = ncont, k_pois = npois,
                    k_nb = nnb, method = "Polynomial", means = means,
                    vars = vars, skews = M[1, ], skurts = M[2, ],
                    fifths = M[3, ], sixths = M[4, ], marginal = marginal,
                    lam = lam, size = size, prob = prob, rho = Rey,
                    seed = seed)

## End(Not run)

AFialkowski/SimMultiCorrData documentation built on May 23, 2019, 9:34 p.m.