GenerateData: Mixed type simulation data generator for sparse CCA

View source: R/GenerateData.R

GenerateDataR Documentation

Mixed type simulation data generator for sparse CCA

Description

GenerateData is used to generate two sets of data of mixed types for sparse CCA under the Gaussian copula model.

Usage

GenerateData(
  n,
  trueidx1,
  trueidx2,
  Sigma1,
  Sigma2,
  maxcancor,
  copula1 = "no",
  copula2 = "no",
  type1 = "continuous",
  type2 = "continuous",
  muZ = NULL,
  c1 = NULL,
  c2 = NULL
)

Arguments

n

Sample size

trueidx1

True canonical direction of length p1 for X1. It will be automatically normalized such that w_1^T Σ_1 w_1 = 1.

trueidx2

True canonical direction of length p2 for X2. It will be automatically normalized such that w_2^T Σ_2 w_2 = 1.

Sigma1

True correlation matrix of latent variable Z1 (p1 by p1).

Sigma2

True correlation matrix of latent variable Z2 (p2 by p2).

maxcancor

True canonical correlation between Z1 and Z2.

copula1

Copula type for the first dataset. U1 = f(Z1), which could be either "exp", "cube".

copula2

Copula type for the second dataset. U2 = f(Z2), which could be either "exp", "cube".

type1

Type of the first dataset X1. Could be "continuous", "trunc" or "binary".

type2

Type of the second dataset X2. Could be "continuous", "trunc" or "binary".

muZ

Mean of latent multivariate normal.

c1

Constant threshold for X1 needed for "trunc" and "binary" data type - the default is NULL.

c2

Constant threshold for X2 needed for "trunc" and "binary" data type - the default is NULL.

Value

GenerateData returns a list containing

  • Z1: latent numeric data matrix (n by p1).

  • Z2: latent numeric data matrix (n by p2).

  • X1: observed numeric data matrix (n by p1).

  • X2: observed numeric data matrix (n by p2).

  • true_w1: normalized true canonical direction of length p1 for X1.

  • true_w2: normalized true canonical direction of length p2 for X2.

  • type: a vector containing types of two datasets.

  • maxcancor: true canonical correlation between Z1 and Z2.

  • c1: constant threshold for X1 for "trunc" and "binary" data type.

  • c2: constant threshold for X2 for "trunc" and "binary" data type.

  • Sigma: true latent correlation matrix of Z1 and Z2 ((p1+p2) by (p1+p2)).

Examples

### Simple example

# Data setting
n <- 100; p1 <- 15; p2 <- 10 # sample size and dimensions for two datasets.
maxcancor <- 0.9 # true canonical correlation

# Correlation structure within each data set
set.seed(0)
perm1 <- sample(1:p1, size = p1);
Sigma1 <- autocor(p1, 0.7)[perm1, perm1]
blockind <- sample(1:3, size = p2, replace = TRUE);
Sigma2 <- blockcor(blockind, 0.7)
mu <- rbinom(p1+p2, 1, 0.5)

# true variable indices for each dataset
trueidx1 <- c(rep(1, 3), rep(0, p1-3))
trueidx2 <- c(rep(1, 2), rep(0, p2-2))

# Data generation
simdata <- GenerateData(n=n, trueidx1 = trueidx1, trueidx2 = trueidx2, maxcancor = maxcancor,
                        Sigma1 = Sigma1, Sigma2 = Sigma2,
                        copula1 = "exp", copula2 = "cube",
                        muZ = mu,
                        type1 = "trunc", type2 = "trunc",
                        c1 = rep(1, p1), c2 =  rep(0, p2)
)
X1 <- simdata$X1
X2 <- simdata$X2

# Check the range of truncation levels of variables
range(colMeans(X1 == 0))
range(colMeans(X2 == 0))

mixedCCA documentation built on Sept. 10, 2022, 1:06 a.m.