simdata_guo: Generates data from 'K' multivariate normal data populations...
In ramhiser/sortinghat: sortinghat

Description Usage Arguments Details Value References Examples

We generate n_k observations (k = 1, …, K) from each of K multivariate normal distributions. Let the kth population have a p-dimensional multivariate normal distribution, N_p(μ_k, Σ_k) with mean vector μ_k and positive-definite covariance matrix Σ_k. Each covariance matrix Σ_k consists of block-diagonal autocorrelation matrices.

1 2	simdata_guo(n, mean, block_size, num_blocks, rho, sigma2 = 1, seed = NULL)

`n`	a vector (of length K) of the sample sizes for each population
`mean`	a vector or a list (of length K) of mean vectors
`block_size`	a vector (of length K) of the sizes of the square block matrices for each population. See details.
`num_blocks`	a vector (of length K) giving the number of block matrices for each population. See details.
`rho`	a vector (of length K) of the values of the autocorrelation parameter for each class covariance matrix
`sigma2`	a vector (of length K) of the variance coefficients for each class covariance matrix
`seed`	seed for random number generation (If `NULL`, does not set seed)

The kth class covariance matrix is defined as

Σ_k = Σ^{(ρ)} \oplus Σ^{(-ρ)} \oplus … \oplus Σ^{(ρ)},

where \oplus denotes the direct sum and the (i,j)th entry of Σ^{(ρ)} is

Σ_{ij}^{(ρ)} = \{ ρ^{|i - j|} \}.

The matrix Σ^{(ρ)} is referred to as a block. Its dimensions are provided in the block_size argument, and the number of blocks are specified in the num_blocks argument.

Each matrix Σ_k is generated by the cov_block_autocorrelation function.

The number of populations, K, is determined from the length of the vector of sample sizes, coden. The mean vectors can be given in a list of length K. If one mean is given (as a vector or a list having 1 element), then all populations share this common mean.

The block sizes can be given as a numeric vector or a single value, in which case the degrees of freedom is replicated K times. The same logic applies to num_blocks, rho, and sigma2.

For each class, the number of features, p, is computed as block_size * num_blocks. The values for p must agree for each class.

The block-diagonal covariance matrix with autocorrelated blocks was popularized by Guo et al. (2007) for studying classification of high-dimensional data.

named list containing:

x:: A matrix whose rows are the observations generated and whose columns are the p features (variables)
y:: A vector denoting the population from which the observation in each row was generated.

Guo, Y., Hastie, T., & Tibshirani, R. (2007). "Regularized linear discriminant analysis and its application in microarrays," Biostatistics, 8, 1, 86-100.

# Generates 10 observations from two multivariate normal populations having
# a block-diagonal autocorrelation structure.
block_size <- 3
num_blocks <- 3
p <- block_size * num_blocks
means_list <- list(seq_len(p), -seq_len(p))
data <- simdata_guo(n = c(10, 10), mean = means_list, block_size = block_size,
                    num_blocks = num_blocks, rho = 0.9, seed = 42)
dim(data$x)
table(data$y)

# Generates 15 observations from each of three multivariate normal
# populations having block-diagonal autocorrelation structures. The
# covariance matrices are unequal.
p <- 16
block_size <- c(2, 4, 8)
num_blocks <- p / block_size
rho <- c(0.1, 0.5, 0.9)
sigma2 <- 1:3
mean_list <- list(rep.int(-5, p), rep.int(0, p), rep.int(5, p))

set.seed(42)
data2 <- simdata_guo(n = c(15, 15, 15), mean = mean_list,
                    block_size = block_size, num_blocks = num_blocks,
                    rho = rho, sigma2 = sigma2)
dim(data2$x)
table(data2$y)