GenerateData: Mixed type simulation data generator for sparse CCA
In mixedCCA: Sparse Canonical Correlation Analysis for High-Dimensional Mixed Data

View source: R/GenerateData.R

GenerateData

R Documentation

Mixed type simulation data generator for sparse CCA

Description

GenerateData is used to generate two sets of data of mixed types for sparse CCA under the Gaussian copula model.

Usage

GenerateData(
  n,
  trueidx1,
  trueidx2,
  Sigma1,
  Sigma2,
  maxcancor,
  copula1 = "no",
  copula2 = "no",
  type1 = "continuous",
  type2 = "continuous",
  muZ = NULL,
  c1 = NULL,
  c2 = NULL
)

Arguments

`n`	Sample size
`trueidx1`	True canonical direction of length p1 for `X1`. It will be automatically normalized such that `w_1^T \Sigma_1 w_1 = 1`.
`trueidx2`	True canonical direction of length p2 for `X2`. It will be automatically normalized such that `w_2^T \Sigma_2 w_2 = 1`.
`Sigma1`	True correlation matrix of latent variable `Z1` (p1 by p1).
`Sigma2`	True correlation matrix of latent variable `Z2` (p2 by p2).
`maxcancor`	True canonical correlation between `Z1` and `Z2`.
`copula1`	Copula type for the first dataset. U1 = f(Z1), which could be either "exp", "cube".
`copula2`	Copula type for the second dataset. U2 = f(Z2), which could be either "exp", "cube".
`type1`	Type of the first dataset `X1`. Could be "continuous", "trunc" or "binary".
`type2`	Type of the second dataset `X2`. Could be "continuous", "trunc" or "binary".
`muZ`	Mean of latent multivariate normal.
`c1`	Constant threshold for `X1` needed for "trunc" and "binary" data type - the default is NULL.
`c2`	Constant threshold for `X2` needed for "trunc" and "binary" data type - the default is NULL.

Value

GenerateData returns a list containing

Z1: latent numeric data matrix (n by p1).
Z2: latent numeric data matrix (n by p2).
X1: observed numeric data matrix (n by p1).
X2: observed numeric data matrix (n by p2).
true_w1: normalized true canonical direction of length p1 for X1.
true_w2: normalized true canonical direction of length p2 for X2.
type: a vector containing types of two datasets.
maxcancor: true canonical correlation between Z1 and Z2.
c1: constant threshold for X1 for "trunc" and "binary" data type.
c2: constant threshold for X2 for "trunc" and "binary" data type.
Sigma: true latent correlation matrix of Z1 and Z2 ((p1+p2) by (p1+p2)).

Examples

### Simple example

# Data setting
n <- 100; p1 <- 15; p2 <- 10 # sample size and dimensions for two datasets.
maxcancor <- 0.9 # true canonical correlation

# Correlation structure within each data set
set.seed(0)
perm1 <- sample(1:p1, size = p1);
Sigma1 <- autocor(p1, 0.7)[perm1, perm1]
blockind <- sample(1:3, size = p2, replace = TRUE);
Sigma2 <- blockcor(blockind, 0.7)
mu <- rbinom(p1+p2, 1, 0.5)

# true variable indices for each dataset
trueidx1 <- c(rep(1, 3), rep(0, p1-3))
trueidx2 <- c(rep(1, 2), rep(0, p2-2))

# Data generation
simdata <- GenerateData(n=n, trueidx1 = trueidx1, trueidx2 = trueidx2, maxcancor = maxcancor,
                        Sigma1 = Sigma1, Sigma2 = Sigma2,
                        copula1 = "exp", copula2 = "cube",
                        muZ = mu,
                        type1 = "trunc", type2 = "trunc",
                        c1 = rep(1, p1), c2 =  rep(0, p2)
)
X1 <- simdata$X1
X2 <- simdata$X2

# Check the range of truncation levels of variables
range(colMeans(X1 == 0))
range(colMeans(X2 == 0))

mixedCCA documentation built on Nov. 18, 2025, 9:06 a.m.