gmsClust: A general implementation of Modha-Spangler clustering for...
In kamila: Methods for Clustering Mixed-Type Data

Description Usage Arguments Details Value References Examples

View source: R/modha_spangler.R

Modha-Spangler clustering estimates the optimal weighting for continuous vs categorical variables using a brute-force search strategy.

gmsClust(
  conData,
  catData,
  nclust,
  searchDensity = 10,
  clustFun = wkmeans,
  conDist = squaredEuc,
  catDist = squaredEuc,
  ...
)

`conData`	A data frame of continuous variables.
`catData`	A data frame of categorical variables; the allowable variable types depend on the specific clustering function used.
`nclust`	An integer specifying the number of clusters.
`searchDensity`	An integer determining the number of distinct cluster weightings evaluated in the brute-force search.
`clustFun`	The clustering function to be applied.
`conDist`	The continuous distance function used to construct the objective function.
`catDist`	The categorical distance function used to construct the objective function.
`...`	Arguments to be passed to the `clustFun`.

Modha-Spangler clustering uses a brute-force search strategy to estimate the optimal weighting for continuous vs categorical variables. This implementation admits an arbitrary clustering function and arbitrary objective functions for continuous and categorical variables.

The input parameter clustFun must be a function accepting inputs (conData, catData, conWeight, nclust, ...) and returning a list containing (at least) the elements cluster, conCenters, and catCenters. The list element "cluster" contains cluster memberships denoted by the integers 1:nclust. The list elements "conCenters" and "catCenters" must be data frames whose rows denote cluster centroids. The function clustFun must allow nclust = 1, in which case $centers returns a data frame with a single row. Input parameters conDist and catDist are functions that must each take two data frame rows as input and return a scalar distance measure.

A list containing the following results objects:

`results`	A results object corresponding to the base clustering algorithm
`objFun`	A numeric vector of length `searchDensity` containing the values of the objective function for each weight used
`Qcon`	A numeric vector of length `searchDensity` containing the values of the continuous component of the objective function
`Qcon`	A numeric vector of length `searchDensity` containing the values of the categorical component of the objective function
`bestInd`	The index of the most successful run
`weights`	A numeric vector of length `searchDensity` containing the continuous weights used

Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13

Modha DS, Spangler WS; Feature Weighting in k-Means Clustering. Machine Learning, 52(3). 2003. doi: 10.1023/a:1024016609528

## Not run: 
# Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar=2, nCatVar=2, nCatLevels=4, nConWithErr=2,
  nCatWithErr=2, popProportions=c(.5,.5), conErrLev=0.3, catErrLev=0.8)
catDf <- dummyCodeFactorDf(data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE))
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)

msRes <- gmsClust(conDf, catDf, nclust=2)

table(msRes$results$cluster, dat$trueID)

## End(Not run)