genMixedData: Generate simulated mixed-type data with cluster structure.

Description Usage Arguments Details Value Examples

View source: R/gen_mixed_data.R

Description

This function simulates mixed-type data sets with a latent cluster structure, with continuous and nominal variables.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
genMixedData(
  sampSize,
  nConVar,
  nCatVar,
  nCatLevels,
  nConWithErr,
  nCatWithErr,
  popProportions,
  conErrLev,
  catErrLev
)

Arguments

sampSize

Integer: Size of the simulated data set.

nConVar

The number of continuous variables.

nCatVar

The number of categorical variables.

nCatLevels

Integer: The number of categories per categorical variables. Currently must be a multiple of the number of populations specified in popProportions.

nConWithErr

Integer: The number of continuous variables with error.

nCatWithErr

Integer: The number of categorical variables with error.

popProportions

A vector of scalars that sums to one. The length gives the number of populations (clusters), with values denoting the prior probability of observing a member of the corresponding population. NOTE: currently only two populations are supported.

conErrLev

A scalar between 0.01 and 1 denoting the univariate overlap between clusters on the continuous variables specified to have error.

catErrLev

Univariate overlap level for the categorical variables with error.

Details

This function simulates mixed-type data sets with a latent cluster structure. Continuous variables follow a normal mixture model, and categorical variables follow a multinomial mixture model. Overlap of the continuous and categorical variables (i.e. how clear the cluster structure is) can be manipulated by the user. Overlap between two clusters is the area of the overlapping region defined by their densities (or, for categorical variables, the summed height of overlapping segments defined by their point masses). The default overlap level is 0.01 (i.e. almost perfect separation). A user-specified number of continuous and categorical variables can be specified to be "error variables" with arbitrary overlap within 0.01 and 1.00 (where 1.00 corresponds to complete overlap). NOTE: Currently, only two populations (clusters) are supported. While exact control of overlap between two clusters is straightforward, controlling the overlap between the K choose 2 pairwise combinations of clusters is a more difficult task.

Value

A list with the following elements:

trueID

Integer vector giving population (cluster) membership of each observation

trueMus

Mean parameters used for population (cluster) centers in the continuous variables

conVars

The continuous variables

errVariance

Variance parameter used for continuous error distribution

popProbsNoErr

Multinomial probability vectors for categorical variables without measurement error

popProbsWithErr

Multinomial probability vectors for categorical variables with measurement error

catVars

The categorical variables

Examples

1
2
3
4
dat <- genMixedData(100, 2, 2, nCatLevels=4, nConWithErr=1, nCatWithErr=1,
  popProportions=c(0.3,0.7), conErrLev=0.3, catErrLev=0.2)
with(dat,plot(conVars,col=trueID))
with(dat,table(data.frame(catVars[,1:2],trueID, stringsAsFactors = TRUE)))

Example output

, , trueID = 1

   X2
X1   1  2  3  4
  1 10  6  2  0
  2  6  6  0  0
  3  0  0  0  0
  4  0  0  0  0

, , trueID = 2

   X2
X1   1  2  3  4
  1  0  0  0  0
  2  0  0  0  0
  3  1  2 12 21
  4  3  3 14 14

kamila documentation built on March 13, 2020, 9:08 a.m.