valid_corr2: Determine Correlation Bounds for Ordinal, Continuous,...
In SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types

Description Usage Arguments Value Reasons for Function Errors The Generate, Sort, and Correlate (GSC, Demirtas & Hedeker, 2011, doi: 10.1198/tast.2011.10090) Algorithm Ordinal Variables Continuous Variables Poisson Variables Negative Binomial Variables Continuous - Ordinal Pairs Ordinal - Poisson Pairs Ordinal - Negative Binomial Pairs Continuous - Poisson Pairs Continuous - Negative Binomial Pairs Poisson - Negative Binomial Pairs References See Also Examples

View source: R/valid_corr2.R

This function calculates the lower and upper correlation bounds for the given distributions and checks if a given target correlation matrix rho is within the bounds. It should be used before simulation with rcorrvar2. However, even if all pairwise correlations fall within the bounds, it is still possible that the desired correlation matrix is not feasible. This is particularly true when ordinal variables (r >= 2 categories) are generated or negative correlations are desired. Therefore, this function should be used as a general check to eliminate pairwise correlations that are obviously not reproducible. It will help prevent errors when executing the simulation.

Note: Some pieces of the function code have been adapted from Demirtas, Hu, & Allozi's (2017) validation_specs. This function (valid_corr2) extends the methods to:

1) non-normal continuous variables generated by Fleishman's third-order or Headrick's fifth-order polynomial transformation method,

2) Negative Binomial variables (including all pairwise correlations involving them), and

3) Count variables are treated as ordinal when calculating the bounds since that is the intermediate correlation calculation method.

Please see the Comparison of Method 1 and Method 2 vignette for more information regarding method 2.

valid_corr2(k_cat = 0, k_cont = 0, k_pois = 0, k_nb = 0,
  method = c("Fleishman", "Polynomial"), means = NULL, vars = NULL,
  skews = NULL, skurts = NULL, fifths = NULL, sixths = NULL,
  Six = list(), marginal = list(), lam = NULL, pois_eps = NULL,
  size = NULL, prob = NULL, mu = NULL, nb_eps = NULL, rho = NULL,
  n = 100000, seed = 1234)

`k_cat`	the number of ordinal (r >= 2 categories) variables (default = 0)
`k_cont`	the number of continuous variables (default = 0)
`k_pois`	the number of Poisson variables (default = 0)
`k_nb`	the number of Negative Binomial variables (default = 0)
`method`	the method used to generate the k_cont continuous variables. "Fleishman" uses a third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation.
`means`	a vector of means for the k_cont continuous variables (i.e. = rep(0, k_cont))
`vars`	a vector of variances (i.e. = rep(1, k_cont))
`skews`	a vector of skewness values (i.e. = rep(0, k_cont))
`skurts`	a vector of standardized kurtoses (kurtosis - 3, so that normal variables have a value of 0; i.e. = rep(0, k_cont))
`fifths`	a vector of standardized fifth cumulants (not necessary for `method` = "Fleishman"; i.e. = rep(0, k_cont))
`sixths`	a vector of standardized sixth cumulants (not necessary for `method` = "Fleishman"; i.e. = rep(0, k_cont))
`Six`	a list of vectors of correction values to add to the sixth cumulants if no valid pdf constants are found, ex: `Six = list(seq(0.01, 2,by = 0.01), seq(1, 10,by = 0.5))`; if no correction is desired for variable Y_i, set the i-th list component equal to NULL
`marginal`	a list of length equal to `k_cat`; the i-th element is a vector of the cumulative probabilities defining the marginal distribution of the i-th variable; if the variable can take r values, the vector will contain r - 1 probabilities (the r-th is assumed to be 1; default = list())
`lam`	a vector of lambda (> 0) constants for the Poisson variables (see `Poisson`)
`pois_eps`	a vector of length `k_pois` containing the truncation values (i.e. = rep(0.0001, k_pois); default = NULL)
`size`	a vector of size parameters for the Negative Binomial variables (see `NegBinomial`)
`prob`	a vector of success probability parameters
`mu`	a vector of mean parameters (*Note: either `prob` or `mu` should be supplied for all Negative Binomial variables, not a mixture; default = NULL)
`nb_eps`	a vector of length `k_nb` containing the truncation values (i.e. = rep(0.0001, k_nb); default = NULL)
`rho`	the target correlation matrix (must be ordered ordinal, continuous, Poisson, Negative Binomial; default = NULL)
`n`	the sample size (i.e. the length of each simulated variable; default = 100000)
`seed`	the seed value for random number generation (default = 1234)

A list with components:

L_rho the lower correlation bound

U_rho the upper correlation bound

If continuous variables are desired, additional components are:

constants the calculated constants

sixth_correction a vector of the sixth cumulant correction values

valid.pdf a vector with i-th component equal to "TRUE" if variable Y_i has a valid power method pdf, else "FALSE"

If a target correlation matrix rho is provided, each pairwise correlation is checked to see if it is within the lower and upper bounds. If the correlation is outside the bounds, the indices of the variable pair are given.

1) The most likely cause for function errors is that no solutions to fleish or poly converged when using find_constants. If this happens, the simulation will stop. It may help to first use find_constants for each continuous variable to determine if a vector of sixth cumulant correction values is needed. If the standardized cumulants are obtained from calc_theory, the user may need to use rounded values as inputs (i.e. skews = round(skews, 8)). Due to the nature of the integration involved in calc_theory, the results are approximations. Greater accuracy can be achieved by increasing the number of subdivisions (sub) used in the integration process. For example, in order to ensure that skew is exactly 0 for symmetric distributions.

2) In addition, the kurtosis may be outside the region of possible values. There is an associated lower boundary for kurtosis associated with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method). Use calc_lower_skurt to determine the boundary for a given set of cumulants.

The GSC algorithm is a flexible method for determining empirical correlation bounds when the theoretical bounds are unknown. The steps are as follows:

1) Generate independent random samples from the desired distributions using a large number of observations (i.e. N = 100,000).

2) Lower Bound: Sort the two variables in opposite directions (i.e., one increasing and one decreasing) and find the sample correlation.

3) Upper Bound: Sort the two variables in the same direction and find the sample correlation.

Demirtas & Hedeker showed that the empirical bounds computed from the GSC method are similar to the theoretical bounds (when they are known).

The processes used to find the correlation bounds for each variable type are described below:

Binary pairs: The correlation bounds are determined as in Demirtas et al. (2012, doi: 10.1002/sim.5362), who used the method of Emrich & Piedmonte (1991, doi: 10.1080/00031305.1991.10475828). The joint distribution is determined by "borrowing" the moments of a multivariate normal distribution. For two binary variables Y_{i} and Y_{j}, with success probabilities p_{i} and p_{j}, the lower correlation bound is given by

max(-√{(p_{i}p_{j})/(q_{i}q_{j})},\ -√{(q_{i}q_{j})/(p_{i}p_{j})})

and the upper bound by

min(√{(p_{i}q_{j})/(q_{i}p_{j})},\ √{(q_{i}p_{j})/(p_{i}q_{j})})

Here, q_{i} = 1 - p_{i} and q_{j} = 1 - p_{j}.

Binary-Ordinal or Ordinal-Ordinal pairs: Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

Continuous variables are randomly generated using constants from find_constants and a vector of sixth cumulant correction values (if provided.) The GSC algorithm is used to find the lower and upper bounds.

The maximum support values, given the vector of cumulative probability truncation values (pois_eps) and vector of means (lam), are calculated using max_count_support. The finite supports are used to determine marginal distributions for each Poisson variable. Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

The maximum support values, given the vector of cumulative probability truncation values (nb_eps) and vectors of sizes and success probabilities (prob) or means (mu), are calculated using max_count_support. The finite supports are used to determine marginal distributions for each Negative Binomial variable. Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

Randomly generated ordinal variables with the given marginal distributions and the previously generated continuous variables are used in the GSC algorithm to find the correlation bounds.

Randomly generated ordinal and Poisson variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

Randomly generated ordinal and Negative Binomial variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

The previously generated continuous variables and randomly generated Poisson variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

The previously generated continuous variables and randomly generated Negative Binomial variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

Please see rcorrvar2 for additional references.

Demirtas H & Hedeker D (2011). A practical way for computing approximate lower and upper correlation bounds. American Statistician, 65(2): 104-109. doi: 10.1198/tast.2011.10090.

Demirtas H, Hedeker D, & Mermelstein RJ (2012). Simulation of massive public health data by power polynomials. Statistics in Medicine, 31(27): 3337-3346. doi: 10.1002/sim.5362.

Emrich LJ & Piedmonte MR (1991). A Method for Generating High-Dimensional Multivariate Binary Variables. The American Statistician, 45(4): 302-4. doi: 10.1080/00031305.1991.10475828.

Frechet M. Sur les tableaux de correlation dont les marges sont donnees. Ann. l'Univ. Lyon SectA. 1951;14:53-77.

Hoeffding W. Scale-invariant correlation theory. In: Fisher NI, Sen PK, editors. The collected works of Wassily Hoeffding. New York: Springer-Verlag; 1994. p. 57-107.

Hakan Demirtas, Yiran Hu and Rawan Allozi (2017). PoisBinOrdNor: Data Generation with Poisson, Binary, Ordinal and Normal Components. R package version 1.4. https://CRAN.R-project.org/package=PoisBinOrdNor

find_constants, rcorrvar2

valid_corr2(n = 1000, k_cat = 1, k_cont = 1, method = "Polynomial",
  means = 0, vars = 1, skews = 0, skurts = 0, fifths = 0, sixths = 0,
  marginal = list(c(1/3, 2/3)), rho = matrix(c(1, 0.4, 0.4, 1), 2, 2))

## Not run: 

# Binary, Ordinal, Continuous, Poisson, and Negative Binomial Variables

options(scipen = 999)
seed <- 1234
n <- 10000

# Continuous Distributions: Normal, t (df = 10), Chisq (df = 4),
#                           Beta (a = 4, b = 2), Gamma (a = 4, b = 4)
Dist <- c("Gaussian", "t", "Chisq", "Beta", "Gamma")

# calculate standardized cumulants
# those for the normal and t distributions are rounded to ensure the
# correct values (i.e. skew = 0)

M1 <- round(calc_theory(Dist = "Gaussian", params = c(0, 1)), 8)
M2 <- round(calc_theory(Dist = "t", params = 10), 8)
M3 <- calc_theory(Dist = "Chisq", params = 4)
M4 <- calc_theory(Dist = "Beta", params = c(4, 2))
M5 <- calc_theory(Dist = "Gamma", params = c(4, 4))
M <- cbind(M1, M2, M3, M4, M5)
M <- round(M[-c(1:2),], digits = 6)
colnames(M) <- Dist
rownames(M) <- c("skew", "skurtosis", "fifth", "sixth")
means <- rep(0, length(Dist))
vars <- rep(1, length(Dist))

# Binary and Ordinal Distributions
marginal <- list(0.3, 0.4, c(0.1, 0.5), c(0.3, 0.6, 0.9),
                 c(0.2, 0.4, 0.7, 0.8))
support <- list()

# Poisson Distributions
lam <- c(1, 5, 10)

# Negative Binomial Distributions
size <- c(3, 6)
prob <- c(0.2, 0.8)

ncat <- length(marginal)
ncont <- ncol(M)
npois <- length(lam)
nnb <- length(size)

# Create correlation matrix from a uniform distribution (-0.8, 0.8)
set.seed(seed)
Rey <- diag(1, nrow = (ncat + ncont + npois + nnb))
for (i in 1:nrow(Rey)) {
  for (j in 1:ncol(Rey)) {
    if (i > j) Rey[i, j] <- runif(1, -0.8, 0.8)
    Rey[j, i] <- Rey[i, j]
  }
}

# Test for positive-definiteness
library(Matrix)
if(min(eigen(Rey, symmetric = TRUE)$values) < 0) {
  Rey <- as.matrix(nearPD(Rey, corr = T, keepDiag = T)$mat)
}

# Make sure Rey is within upper and lower correlation limits
valid <- valid_corr2(k_cat = ncat, k_cont = ncont, k_pois = npois,
                     k_nb = nnb, method = "Polynomial", means = means,
                     vars = vars, skews = M[1, ], skurts = M[2, ],
                     fifths = M[3, ], sixths = M[4, ], marginal = marginal,
                     lam = lam, pois_eps = rep(0.0001, npois),
                     size = size, prob = prob, nb_eps = rep(0.0001, nnb),
                     rho = Rey, seed = seed)


## End(Not run)

SimMultiCorrData documentation built on May 2, 2019, 9:50 a.m.

SimMultiCorrData index

README.md Benefits of SimMultiCorrData and Comparison to Other Packages" Comparison of Correlation Method 1 and Correlation Method 2" Comparison of Simulated Distribution to Theoretical Distribution or Empirical Data" Functions by Topic" Overall Workflow for Data Simulation" Overview of Error Loop" Using the Sixth Cumulant Correction to Find Valid Power Method Pdfs" Variable Types"

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

SimMultiCorrData
Simulation of Correlated Data with Multiple Variable Types

valid_corr2: Determine Correlation Bounds for Ordinal, Continuous,...
In SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types

Description

Usage

Arguments

Value

Reasons for Function Errors

The Generate, Sort, and Correlate (GSC, Demirtas & Hedeker, 2011, doi: 10.1198/tast.2011.10090) Algorithm

Ordinal Variables

Continuous Variables

Poisson Variables

Negative Binomial Variables

Continuous - Ordinal Pairs

Ordinal - Poisson Pairs

Ordinal - Negative Binomial Pairs

Continuous - Poisson Pairs

Continuous - Negative Binomial Pairs

Poisson - Negative Binomial Pairs

References

See Also

Examples

Related to valid_corr2 in SimMultiCorrData...

R Package Documentation

Browse R Packages

We want your feedback!

SimMultiCorrData Simulation of Correlated Data with Multiple Variable Types

valid_corr2: Determine Correlation Bounds for Ordinal, Continuous,... In SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types

Description

Usage

Arguments

Value

Reasons for Function Errors

The Generate, Sort, and Correlate (GSC, Demirtas & Hedeker, 2011, doi: 10.1198/tast.2011.10090) Algorithm

Ordinal Variables

Continuous Variables

Poisson Variables

Negative Binomial Variables

Continuous - Ordinal Pairs

Ordinal - Poisson Pairs

Ordinal - Negative Binomial Pairs

Continuous - Poisson Pairs

Continuous - Negative Binomial Pairs

Poisson - Negative Binomial Pairs

References

See Also

Examples

Related to valid_corr2 in SimMultiCorrData...

R Package Documentation

Browse R Packages

We want your feedback!

SimMultiCorrData
Simulation of Correlated Data with Multiple Variable Types

valid_corr2: Determine Correlation Bounds for Ordinal, Continuous,...
In SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types