Generate simulated mixed-type data with cluster structure.

Share:

Description

This function simulates mixed-type data sets with a latent cluster structure, with continuous and nominal variables.

Usage

1
2
genMixedData(sampSize, nConVar, nCatVar, nCatLevels, nConWithErr, nCatWithErr,
  popProportions, conErrLev, catErrLev)

Arguments

sampSize

Integer: Size of the simulated data set.

nConVar

The number of continuous variables.

nCatVar

The number of categorical variables.

nCatLevels

Integer: The number of categories per categorical variables. Currently must be a multiple of the number of populations specified in popProportions.

nConWithErr

Integer: The number of continuous variables with error.

nCatWithErr

Integer: The number of categorical variables with error.

popProportions

A vector of scalars that sums to one. The length gives the number of populations (clusters), with values denoting the prior probability of observing a member of the corresponding population. NOTE: currently only two populations are supported.

conErrLev

A scalar between 0.01 and 1 denoting the univariate overlap between clusters on the continuous variables specified to have error.

catErrLev

Univariate overlap level for the categorical variables with error.

Details

This function simulates mixed-type data sets with a latent cluster structure. Continuous variables follow a normal mixture model, and categorical variables follow a multinomial mixture model. Overlap of the continuous and categorical variables (i.e. how clear the cluster structure is) can be manipulated by the user. The default overlap level is 1 percent (i.e. almost perfect separation), and a user-specified number of continuous and categorical variables can be specified to be measured with error, in which case the overlap can be selectively set to be anywhere within 1 and 100 percent (100 percent corresponds to complete overlap).

NOTE: Currently, only two populations (clusters) are supported.

Value

A list with the following elements:

trueID

Integer vector giving population (cluster) membership of each observation

trueMus

Mean parameters used for population (cluster) centers in the continuous variables

conVars

The continuous variables

errVariance

Variance parameter used for continuous error distribution

popProbsNoErr

Multinomial probability vectors for categorical variables without measurement error

popProbsWithErr

Multinomial probability vectors for categorical variables with measurement error

catVars

The categorical variables

Examples

1
2
3
4
dat <- genMixedData(100, 2, 2, nCatLevels=4, nConWithErr=1, nCatWithErr=1,
  popProportions=c(0.3,0.7), conErrLev=0.3, catErrLev=0.2)
with(dat,plot(conVars,col=trueID))
with(dat,table(data.frame(catVars[,1:2],trueID)))