confactord: Mixed-type Data Generation with True Membership Labels
In kdml: Kernel Distance Metric Learning for Mixed-Type Data

confactord

R Documentation

Mixed-type Data Generation with True Membership Labels

Description

This function generates a mixed-type data frame with a combination of continuous (numeric), nominal (factor), and ordinal (ordered) variables with prespecified cluster overlap for each variable type. confactord allows the user to specify the number of each variable type, the amount of variables per variable type that have cluster overlap, the amount of cluster overlap for each variable type, the number of levels for the nominal and ordinal variables, and proportion of observations per class membership. Within and across-type variables are generated independently from one another. Currently, only two classes are may be generated.

Usage

confactord(n = 200, 
            popProb = c(0.5,0.5), 
            numMixVar = c(1,1,1), 
            numMixVarOl = c(1,1,1),  
            olVarType = c(0.1,0.1,0.1), 
            catLevels = c(2,4))

Arguments

`n`	integer number of observations to be generated. Defaults to `n = 200`
`popProb`	numeric vector of length two specifying the proportion of observations allocated to each class membership, which must sum to one. Defaults to `popProb = c(0.5, 0.5)`.
`numMixVar`	numeric vector of integers of length three specifying (in order) the total number of continuous (numeric), nominal (factor), and ordinal (ordered) variables to be generated. If a specific variable type is not required, set the appropriate vector indice to zero. Defaults to `numMixVar = c(1,1,1)`.
`numMixVarOl`	numeric vector of integers of length three specifying (in order) the total number of continuous (numeric), nominal (factor), and ordinal (ordered) variables that will have class membership overlap. If all variables are to be well-separated by class membership, set all indices to zero. No indice of this vector may be greater than the corresponding indice in `numMixVar`. Defaults to `numMixVarOl = c(1,1,1)`.
`olVarType`	numeric vector of length three specifying (in order) the percentage of class membership overlap to be applied to the continuous (numeric), nominal (factor), and ordinal (ordered) No argument required if `numMixVarOl = c(0,0,0)`. Permissible class membership overlap per variable type is between 0.01 and 0.99. Defaults to ten percent overlap per variable type, `olVarType = c(0.1,0.1,0.1)`.
`catLevels`	numeric vector of length two specifying (in order) the number of levels (integer values) for each of the nominal (factor) and ordinal (ordered) variable types. Defaults to `catLevels = c(2,4)`.

Details

Continuous variables are generated independently from normal distributions, with means determined by true class membership. If overlap is specified, additional variance is introduced to simulate cluster overlap. Nominal variables are generated using Dirichlet distributions representing different population proportions. Ordinal variables are initially simulated as continuous variables and then discretized into ordered categories based on quantile distributions, similar to a latent class model where ordinal categories are inferred based on underlying continuous distributions and adjusted for cluster overlap parameters.

Value

confactord returns a list object, with the following components:

`data`	a `data.frame` of mixed variable types based on user- specified parameters
`class`	a numeric vector of integers specifying the true class memberships for the returned `data` data frame

Author(s)

John R. J. Thompson john.thompson@ubc.ca, Jesse S. Ghashti jesse.ghashti@ubc.ca

Examples

# EXAMPLE1: Default implementation generates the following
# 200 observations split into two clusters of equal size (100 observations each) 
# Three variables-- one of each numeric, factor, and ordered
# Each variable has ten percent cluster overlap
# Nominal variable is binary
# Ordinal variable has four levels

df1 <- confactord()


# EXAMPLE2: 
# 500 observations; 100 observations in cluster one and 400 in cluster two 
# Three continuous variables, two nominal, one ordinal
# Only one continuous variable has cluster overlap
# All nominal and ordinal variables have cluster overlap
# Cluster overlap for continuous variable is twenty percent
# Cluster overlap for nominal variables are thirty percent
# Cluster overlap for ordinal variable is fourty percent
# Nominal variable has three levels, while ordinal has 5

df2 <- confactord(n = 500, 
                    popProb = c(0.2,0.8), 
                    numMixVar = c(3,2,1), 
                    numMixVarOl = c(1,2,1),  
                    olVarType = c(0.2,0.3,0.4), 
                    catLevels = c(3,5))

kdml documentation built on Sept. 21, 2024, 9:06 a.m.