gmsClust: A general implementation of Modha-Spangler clustering for...

Description Usage Arguments Details Value References Examples

View source: R/modha_spangler.R

Description

Modha-Spangler clustering estimates the optimal weighting for continuous vs categorical variables using a brute-force search strategy.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
gmsClust(
  conData,
  catData,
  nclust,
  searchDensity = 10,
  clustFun = wkmeans,
  conDist = squaredEuc,
  catDist = squaredEuc,
  ...
)

Arguments

conData

A data frame of continuous variables.

catData

A data frame of categorical variables; the allowable variable types depend on the specific clustering function used.

nclust

An integer specifying the number of clusters.

searchDensity

An integer determining the number of distinct cluster weightings evaluated in the brute-force search.

clustFun

The clustering function to be applied.

conDist

The continuous distance function used to construct the objective function.

catDist

The categorical distance function used to construct the objective function.

...

Arguments to be passed to the clustFun.

Details

Modha-Spangler clustering uses a brute-force search strategy to estimate the optimal weighting for continuous vs categorical variables. This implementation admits an arbitrary clustering function and arbitrary objective functions for continuous and categorical variables.

The input parameter clustFun must be a function accepting inputs (conData, catData, conWeight, nclust, ...) and returning a list containing (at least) the elements cluster, conCenters, and catCenters. The list element "cluster" contains cluster memberships denoted by the integers 1:nclust. The list elements "conCenters" and "catCenters" must be data frames whose rows denote cluster centroids. The function clustFun must allow nclust = 1, in which case $centers returns a data frame with a single row. Input parameters conDist and catDist are functions that must each take two data frame rows as input and return a scalar distance measure.

Value

A list containing the following results objects:

results

A results object corresponding to the base clustering algorithm

objFun

A numeric vector of length searchDensity containing the values of the objective function for each weight used

Qcon

A numeric vector of length searchDensity containing the values of the continuous component of the objective function

Qcon

A numeric vector of length searchDensity containing the values of the categorical component of the objective function

bestInd

The index of the most successful run

weights

A numeric vector of length searchDensity containing the continuous weights used

References

Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13

Modha DS, Spangler WS; Feature Weighting in k-Means Clustering. Machine Learning, 52(3). 2003. doi: 10.1023/a:1024016609528

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## Not run: 
# Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar=2, nCatVar=2, nCatLevels=4, nConWithErr=2,
  nCatWithErr=2, popProportions=c(.5,.5), conErrLev=0.3, catErrLev=0.8)
catDf <- dummyCodeFactorDf(data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE))
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)

msRes <- gmsClust(conDf, catDf, nclust=2)

table(msRes$results$cluster, dat$trueID)

## End(Not run)

Example output

   
     1  2
  1 89 23
  2 13 75

kamila documentation built on March 13, 2020, 9:08 a.m.