genMixedData: Generate simulated mixed-type data with cluster structure.
In kamila: Methods for Clustering Mixed-Type Data

Description Usage Arguments Details Value Examples

View source: R/gen_mixed_data.R

This function simulates mixed-type data sets with a latent cluster structure, with continuous and nominal variables.

genMixedData(
  sampSize,
  nConVar,
  nCatVar,
  nCatLevels,
  nConWithErr,
  nCatWithErr,
  popProportions,
  conErrLev,
  catErrLev
)

`sampSize`	Integer: Size of the simulated data set.
`nConVar`	The number of continuous variables.
`nCatVar`	The number of categorical variables.
`nCatLevels`	Integer: The number of categories per categorical variables. Currently must be a multiple of the number of populations specified in popProportions.
`nConWithErr`	Integer: The number of continuous variables with error.
`nCatWithErr`	Integer: The number of categorical variables with error.
`popProportions`	A vector of scalars that sums to one. The length gives the number of populations (clusters), with values denoting the prior probability of observing a member of the corresponding population. NOTE: currently only two populations are supported.
`conErrLev`	A scalar between 0.01 and 1 denoting the univariate overlap between clusters on the continuous variables specified to have error.
`catErrLev`	Univariate overlap level for the categorical variables with error.

This function simulates mixed-type data sets with a latent cluster structure. Continuous variables follow a normal mixture model, and categorical variables follow a multinomial mixture model. Overlap of the continuous and categorical variables (i.e. how clear the cluster structure is) can be manipulated by the user. Overlap between two clusters is the area of the overlapping region defined by their densities (or, for categorical variables, the summed height of overlapping segments defined by their point masses). The default overlap level is 0.01 (i.e. almost perfect separation). A user-specified number of continuous and categorical variables can be specified to be "error variables" with arbitrary overlap within 0.01 and 1.00 (where 1.00 corresponds to complete overlap). NOTE: Currently, only two populations (clusters) are supported. While exact control of overlap between two clusters is straightforward, controlling the overlap between the K choose 2 pairwise combinations of clusters is a more difficult task.

A list with the following elements:

`trueID`	Integer vector giving population (cluster) membership of each observation
`trueMus`	Mean parameters used for population (cluster) centers in the continuous variables
`conVars`	The continuous variables
`errVariance`	Variance parameter used for continuous error distribution
`popProbsNoErr`	Multinomial probability vectors for categorical variables without measurement error
`popProbsWithErr`	Multinomial probability vectors for categorical variables with measurement error
`catVars`	The categorical variables

dat <- genMixedData(100, 2, 2, nCatLevels=4, nConWithErr=1, nCatWithErr=1,
  popProportions=c(0.3,0.7), conErrLev=0.3, catErrLev=0.2)
with(dat,plot(conVars,col=trueID))
with(dat,table(data.frame(catVars[,1:2],trueID, stringsAsFactors = TRUE)))

, , trueID = 1

   X2
X1   1  2  3  4
  1 10  6  2  0
  2  6  6  0  0
  3  0  0  0  0
  4  0  0  0  0

, , trueID = 2

   X2
X1   1  2  3  4
  1  0  0  0  0
  2  0  0  0  0
  3  1  2 12 21
  4  3  3 14 14