Description Usage Arguments Details Value References Examples
KAMILA is an iterative clustering method that equitably balances the contribution of continuous and categorical variables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
conVar 
A data frame of continuous variables. 
catFactor 
A data frame of factors. 
numClust 
The number of clusters returned by the algorithm. 
numInit 
The number of initializations used. 
conWeights 
A vector of continuous weights for the continuous variables. 
catWeights 
A vector of continuous weights for the categorical variables. 
maxIter 
The maximum number of iterations in each run. 
conInitMethod 
Character: The method used to initialize each run. 
catBw 
The bandwidth used for the categorical kernel. 
verbose 
Logical: Whether detailed results should be printed and returned. 
calcNumClust 
Character: Method for selecting the number of clusters. 
numPredStrCvRun 
Numeric: Number of CV runs for prediction strength method. Ignored unless calcNumClust == 'ps' 
predStrThresh 
Numeric: Threshold for prediction strength method. Ignored unless calcNumClust == 'ps' 
KAMILA (KAymeans for MIxed LArge data sets) is an iterative clustering method that equitably balances the contribution of the continuous and categorical variables. It uses a kernel density estimation technique to flexibly model spherical clusters in the continuous domain, and uses a multinomial model in the categorical domain.
Weighting scheme: If no weights are desired, set all weights to 1 (the default setting). Let a_1, ..., a_p denote the weights for p continuous variables. Let b_1, ..., b_q denote the weights for q categorical variables. Currently, continuous weights are applied during the calculation of Euclidean distance, as: Categorical weights are applied to the loglikelihoods obtained by the level probabilities given cluster membership as: Total log likelihood for the kth cluster is obtained by weighting the single continuous loglikelihood by the mean of all continuous weights plus logLikCat_k: Note that weights between 0 and 1 are admissible; weights equal to zero completely remove a variable's influence on the clustering; weights equal to 1 leave a variable's contribution unchanged. Weights between 0 and 1 may not be comparable across continuous and categorical variables. Estimating the number of clusters: Default is no estimation method. Setting calcNumClust to 'ps' uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes. The user must specify the number of crossvalidation runs and the threshold for determining the number of clusters. The smaller the threshold, the larger the number of clusters selected.
A list with the following results objects:
finalMemb 
A numeric vector with cluster assignment indicated by integer. 
numIter 

finalLogLik 
The pseudo loglikelihood of the returned clustering. 
finalObj 

finalCenters 

finalProbs 

input 
Object with the given input parameter values. 
nClust 
An object describing the results of selecting the number of clusters, empty if calcNumClust == 'none'. 
verbose 
An optionally returned object with more detailed information. 
Foss A, Markatou M; kamila: Clustering MixedType Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13
1 2 3 4 5 6 7 8 9 10 11 12  # Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat < genMixedData(200, nConVar = 2, nCatVar = 2, nCatLevels = 4,
nConWithErr = 2, nCatWithErr = 2, popProportions = c(.5, .5),
conErrLev = 0.3, catErrLev = 0.8)
catDf < data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE)
conDf < data.frame(scale(dat$conVars), stringsAsFactors = TRUE)
kamRes < kamila(conDf, catDf, numClust = 2, numInit = 10)
table(kamRes$finalMemb, dat$trueID)

1 2
1 10 91
2 92 7
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.