Description Usage Arguments Details Value References Examples
KAMILA is an iterative clustering method that equitably balances the contribution of continuous and categorical variables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
conVar |
A data frame of continuous variables. |
catFactor |
A data frame of factors. |
numClust |
The number of clusters returned by the algorithm. |
numInit |
The number of initializations used. |
conWeights |
A vector of continuous weights for the continuous variables. |
catWeights |
A vector of continuous weights for the categorical variables. |
maxIter |
The maximum number of iterations in each run. |
conInitMethod |
Character: The method used to initialize each run. |
catBw |
The bandwidth used for the categorical kernel. |
verbose |
Logical: Whether detailed results should be printed and returned. |
calcNumClust |
Character: Method for selecting the number of clusters. |
numPredStrCvRun |
Numeric: Number of CV runs for prediction strength method. Ignored unless calcNumClust == 'ps' |
predStrThresh |
Numeric: Threshold for prediction strength method. Ignored unless calcNumClust == 'ps' |
KAMILA (KAy-means for MIxed LArge data sets) is an iterative clustering method that equitably balances the contribution of the continuous and categorical variables. It uses a kernel density estimation technique to flexibly model spherical clusters in the continuous domain, and uses a multinomial model in the categorical domain.
Weighting scheme: If no weights are desired, set all weights to 1 (the default setting). Let a_1, ..., a_p denote the weights for p continuous variables. Let b_1, ..., b_q denote the weights for q categorical variables. Currently, continuous weights are applied during the calculation of Euclidean distance, as: Categorical weights are applied to the log-likelihoods obtained by the level probabilities given cluster membership as: Total log likelihood for the kth cluster is obtained by weighting the single continuous log-likelihood by the mean of all continuous weights plus logLikCat_k: Note that weights between 0 and 1 are admissible; weights equal to zero completely remove a variable's influence on the clustering; weights equal to 1 leave a variable's contribution unchanged. Weights between 0 and 1 may not be comparable across continuous and categorical variables. Estimating the number of clusters: Default is no estimation method. Setting calcNumClust to 'ps' uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes. The user must specify the number of cross-validation runs and the threshold for determining the number of clusters. The smaller the threshold, the larger the number of clusters selected.
A list with the following results objects:
finalMemb |
A numeric vector with cluster assignment indicated by integer. |
numIter |
|
finalLogLik |
The pseudo log-likelihood of the returned clustering. |
finalObj |
|
finalCenters |
|
finalProbs |
|
input |
Object with the given input parameter values. |
nClust |
An object describing the results of selecting the number of clusters, empty if calcNumClust == 'none'. |
verbose |
An optionally returned object with more detailed information. |
Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13
1 2 3 4 5 6 7 8 9 10 11 12 | # Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar = 2, nCatVar = 2, nCatLevels = 4,
nConWithErr = 2, nCatWithErr = 2, popProportions = c(.5, .5),
conErrLev = 0.3, catErrLev = 0.8)
catDf <- data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE)
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)
kamRes <- kamila(conDf, catDf, numClust = 2, numInit = 10)
table(kamRes$finalMemb, dat$trueID)
|
1 2
1 10 91
2 92 7
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.