kamila: KAMILA clustering of mixed-type data.

Description Usage Arguments Details Value References Examples

View source: R/kamila.R

Description

KAMILA is an iterative clustering method that equitably balances the contribution of continuous and categorical variables.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
kamila(
  conVar,
  catFactor,
  numClust,
  numInit,
  conWeights = rep(1, ncol(conVar)),
  catWeights = rep(1, ncol(catFactor)),
  maxIter = 25,
  conInitMethod = "runif",
  catBw = 0.025,
  verbose = FALSE,
  calcNumClust = "none",
  numPredStrCvRun = 10,
  predStrThresh = 0.8
)

Arguments

conVar

A data frame of continuous variables.

catFactor

A data frame of factors.

numClust

The number of clusters returned by the algorithm.

numInit

The number of initializations used.

conWeights

A vector of continuous weights for the continuous variables.

catWeights

A vector of continuous weights for the categorical variables.

maxIter

The maximum number of iterations in each run.

conInitMethod

Character: The method used to initialize each run.

catBw

The bandwidth used for the categorical kernel.

verbose

Logical: Whether detailed results should be printed and returned.

calcNumClust

Character: Method for selecting the number of clusters.

numPredStrCvRun

Numeric: Number of CV runs for prediction strength method. Ignored unless calcNumClust == 'ps'

predStrThresh

Numeric: Threshold for prediction strength method. Ignored unless calcNumClust == 'ps'

Details

KAMILA (KAy-means for MIxed LArge data sets) is an iterative clustering method that equitably balances the contribution of the continuous and categorical variables. It uses a kernel density estimation technique to flexibly model spherical clusters in the continuous domain, and uses a multinomial model in the categorical domain.

Weighting scheme: If no weights are desired, set all weights to 1 (the default setting). Let a_1, ..., a_p denote the weights for p continuous variables. Let b_1, ..., b_q denote the weights for q categorical variables. Currently, continuous weights are applied during the calculation of Euclidean distance, as: Categorical weights are applied to the log-likelihoods obtained by the level probabilities given cluster membership as: Total log likelihood for the kth cluster is obtained by weighting the single continuous log-likelihood by the mean of all continuous weights plus logLikCat_k: Note that weights between 0 and 1 are admissible; weights equal to zero completely remove a variable's influence on the clustering; weights equal to 1 leave a variable's contribution unchanged. Weights between 0 and 1 may not be comparable across continuous and categorical variables. Estimating the number of clusters: Default is no estimation method. Setting calcNumClust to 'ps' uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes. The user must specify the number of cross-validation runs and the threshold for determining the number of clusters. The smaller the threshold, the larger the number of clusters selected.

Value

A list with the following results objects:

finalMemb

A numeric vector with cluster assignment indicated by integer.

numIter
finalLogLik

The pseudo log-likelihood of the returned clustering.

finalObj
finalCenters
finalProbs
input

Object with the given input parameter values.

nClust

An object describing the results of selecting the number of clusters, empty if calcNumClust == 'none'.

verbose

An optionally returned object with more detailed information.

References

Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar = 2, nCatVar = 2, nCatLevels = 4,
  nConWithErr = 2, nCatWithErr = 2, popProportions = c(.5, .5),
  conErrLev = 0.3, catErrLev = 0.8)
catDf <- data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE)
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)

kamRes <- kamila(conDf, catDf, numClust = 2, numInit = 10)

table(kamRes$finalMemb, dat$trueID)

Example output

   
     1  2
  1 10 91
  2 92  7

kamila documentation built on March 13, 2020, 9:08 a.m.