confactord: Mixed-type Data Generation with True Membership Labels

View source: R/confactord.R

confactordR Documentation

Mixed-type Data Generation with True Membership Labels

Description

This function generates a mixed-type data frame with a combination of continuous (numeric), nominal (factor), and ordinal (ordered) variables with prespecified cluster overlap for each variable type. confactord allows the user to specify the number of each variable type, the amount of variables per variable type that have cluster overlap, the amount of cluster overlap for each variable type, the number of levels for the nominal and ordinal variables, and proportion of observations per class membership. Within and across-type variables are generated independently from one another. Currently, only two classes are may be generated.

Usage

confactord(n = 200, 
            popProb = c(0.5,0.5), 
            numMixVar = c(1,1,1), 
            numMixVarOl = c(1,1,1),  
            olVarType = c(0.1,0.1,0.1), 
            catLevels = c(2,4))

Arguments

n

integer number of observations to be generated. Defaults to n = 200

popProb

numeric vector of length two specifying the proportion of observations allocated to each class membership, which must sum to one. Defaults to popProb = c(0.5, 0.5).

numMixVar

numeric vector of integers of length three specifying (in order) the total number of continuous (numeric), nominal (factor), and ordinal (ordered) variables to be generated. If a specific variable type is not required, set the appropriate vector indice to zero. Defaults to numMixVar = c(1,1,1).

numMixVarOl

numeric vector of integers of length three specifying (in order) the total number of continuous (numeric), nominal (factor), and ordinal (ordered) variables that will have class membership overlap. If all variables are to be well-separated by class membership, set all indices to zero. No indice of this vector may be greater than the corresponding indice in numMixVar. Defaults to numMixVarOl = c(1,1,1).

olVarType

numeric vector of length three specifying (in order) the percentage of class membership overlap to be applied to the continuous (numeric), nominal (factor), and ordinal (ordered) No argument required if numMixVarOl = c(0,0,0). Permissible class membership overlap per variable type is between 0.01 and 0.99. Defaults to ten percent overlap per variable type, olVarType = c(0.1,0.1,0.1).

catLevels

numeric vector of length two specifying (in order) the number of levels (integer values) for each of the nominal (factor) and ordinal (ordered) variable types. Defaults to catLevels = c(2,4).

Details

Continuous variables are generated independently from normal distributions, with means determined by true class membership. If overlap is specified, additional variance is introduced to simulate cluster overlap. Nominal variables are generated using Dirichlet distributions representing different population proportions. Ordinal variables are initially simulated as continuous variables and then discretized into ordered categories based on quantile distributions, similar to a latent class model where ordinal categories are inferred based on underlying continuous distributions and adjusted for cluster overlap parameters.

Value

confactord returns a list object, with the following components:

data

a data.frame of mixed variable types based on user- specified parameters

class

a numeric vector of integers specifying the true class memberships for the returned data data frame

Author(s)

John R. J. Thompson john.thompson@ubc.ca, Jesse S. Ghashti jesse.ghashti@ubc.ca

See Also

mscv.dkss, mscv.dkps, dkss, dkps

Examples

# EXAMPLE1: Default implementation generates the following
# 200 observations split into two clusters of equal size (100 observations each) 
# Three variables-- one of each numeric, factor, and ordered
# Each variable has ten percent cluster overlap
# Nominal variable is binary
# Ordinal variable has four levels

df1 <- confactord()


# EXAMPLE2: 
# 500 observations; 100 observations in cluster one and 400 in cluster two 
# Three continuous variables, two nominal, one ordinal
# Only one continuous variable has cluster overlap
# All nominal and ordinal variables have cluster overlap
# Cluster overlap for continuous variable is twenty percent
# Cluster overlap for nominal variables are thirty percent
# Cluster overlap for ordinal variable is fourty percent
# Nominal variable has three levels, while ordinal has 5

df2 <- confactord(n = 500, 
                    popProb = c(0.2,0.8), 
                    numMixVar = c(3,2,1), 
                    numMixVarOl = c(1,2,1),  
                    olVarType = c(0.2,0.3,0.4), 
                    catLevels = c(3,5))

kdml documentation built on Sept. 21, 2024, 9:06 a.m.