wKModes: Weighted K-Modes Clustering with Tie-Breaking

wKModesR Documentation

Weighted K-Modes Clustering with Tie-Breaking

Description

Perform k-modes clustering on categorical data with observation-specific sampling weights and tie-breaking adjustments.

Usage

wKModes(data,
        modes,
        weights = NULL,
        iter.max = .Machine$integer.max,
        freq.weighted = FALSE,
        fast = TRUE,
        random = TRUE,
        ...)

Arguments

data

A matrix or data frame of categorical data. Objects have to be in rows, variables in columns.

modes

Either the number of modes or a set of initial (distinct) cluster modes (where each mode is a row and modes has the same number of columns as data). If a number, a random set of (distinct) rows in data is chosen as the initial modes. Note, this randomness is always present, and is not governed by random below.

weights

Optional numeric vector containing non-negative observation-specific case weights.

iter.max

The maximum number of iterations allowed. Defaults to .Machine$integer.max. The algorithm terminates when iter.max is reached or when the partition ceases to change between iterations.

freq.weighted

A logical indicating whether the usual simple-matching (Hamming) distance between objects is used, or a frequency weighted version of this distance. Defaults to FALSE; when TRUE, the frequency weights are computed within the algorithm and are not user-specified. Distinct from the observation-level weights above, the frequency weights are assigned on a per-feature basis and derived from the categories represented in each column of data. For convenience, the function dist_freqwH is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.

fast

A logical indicating whether a fast version of the algorithm should be applied. Defaults to TRUE.

random

A logical indicating whether ties for the modal values &/or assignments are broken at random. Defaults to TRUE: the implied default had been FALSE prior to version 1.3.2 of this package, as per klaR::kmodes prior to version 1.7-1 (see Note). Note that when modes is specified as the number of modes, the algorithm is always randomly initialised, regardless of the specification of random.

Regarding the modes, ties are broken at random when TRUE and the first candidate state is always chosen for the mode when FALSE. Regarding assignments, tie-breaking is always first biased in favour of the observation's most recent cluster: regarding ties thereafter, these are broken at random when TRUE or the first other candidate cluster is always chosen when FALSE.

...

Catches unused arguments.

Details

The k-modes algorithm (Huang, 1997) is an extension of the k-means algorithm by MacQueen (1967).

The data given by data is clustered by the k-modes method (Huang, 1997) which aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimised.

By default, the simple-matching (Hamming) distance is used to determine the dissimilarity of two objects. It is computed by counting the number of mismatches in all variables. Alternatively, this distance can be weighted by the frequencies of the categories in data, using the freq.weighted argument (see Huang, 1997, for details). For convenience, the function dist_freqwH is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.

If an initial matrix of modes is supplied, it is possible that no object will be closest to one or more modes. In this case, fewer clusters than the number of supplied modes will be returned and a warning will be printed.

If called using fast = TRUE, the reassignment of the data to clusters is done for the entire data set before recomputation of the modes is done. For computational reasons, this option should be chosen for all but the most moderate of data sizes.

Value

An object of class "wKModes" which is a list with the following components:

cluster

A vector of integers indicating the cluster to which each object is allocated.

size

The number of objects in each cluster.

modes

A matrix of cluster modes.

withindiff

The within-cluster (weighted) simple-matching distance for each cluster.

tot.withindiff

The total within-cluster (weighted) distance over all clusters. tot.withindiff can be used to guide the choice of the number of clusters, but beware of inherent randomness in the algorithm, which is liable to yield a jagged elbow plot (see examples).

iterations

The number of iterations the algorithm reached.

weighted

A logical indicating whether observation-level weights were used or not throughout the algorithm.

freq.weighted

A logical indicating whether feature-level freq.weights were used or not in the computation of the distances. For convenience, the function dist_freqwH is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.

random

A logical indicating whether ties were broken at random or not throughout the algorithm.

Note

This code is adapted from the kmodes function in the klaR package. Specifically, modifications were made to allow for random tie-breaking for the modes and assignments (see random above) and the incorporation of observation-specific sampling weights, with a view to using this function as a means to initialise the allocations for MEDseq models (see the MEDseq_control argument init.z and the related options "kmodes" and "kmodes2").

Notably, the wKModes function, when invoked inside MEDseq_fit, is used regardless of whether the weights are true sampling weights, or the weights are merely aggregation weights, or there are no weights at all. Furthermore, the MEDseq_control argument random is also passed to wKModes when it is invoked inside MEDseq_fit.

Update: as of version 1.7-1 of klaR, klaR::kmodes now breaks assignment ties at random only when fast=TRUE. It still breaks assignment ties when fast=FALSE and all ties for modal values in the non-random manner described above. Thus, the old behaviour of klaR::kmodes can be recovered by specifying random=FALSE here, but random=TRUE allows random tie-breaking for both types of ties in all situations.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie> (adapted from klaR::kmodes)

References

Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. In H. Lu, H. Motoda, and H. Luu (Eds.), KDD: Techniques and Applications, pp. 21-34. Singapore: World Scientific.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. L. Cam and J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pp. 281-297. Berkeley, CA, USA: University of California Press.

See Also

MEDseq_control, MEDseq_fit, dist_freqwH

Examples

suppressMessages(require(WeightedCluster))
set.seed(99)
# Load the MVAD data & aggregate the state sequences
data(mvad)
agg      <- wcAggregateCases(mvad[,17:86], weights=mvad$weight)

# Create a state sequence object without the first two (summer) time points
states   <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels   <- c("Employment", "Further Education", "Higher Education", 
              "Joblessness", "School", "Training")
mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], 
                   states=states, labels=labels, 
                   weights=agg$aggWeights)

# Run k-modes without the weights
resX     <- wKModes(mvad.seq, 2)

# Run k-modes with the weights
resW     <- wKModes(mvad.seq, 2, weights=agg$aggWeights)

# Examine the modal sequences of both solutions
seqformat(seqdef(resX$modes), from="STS", to="SPS", compress=TRUE)
seqformat(seqdef(resW$modes), from="STS", to="SPS", compress=TRUE)

# Using tot.withindiff to choose the number of clusters

TWdiffs   <- sapply(1:5, function(k) wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff)
plot(TWdiffs, type="b", xlab="K")

# Use multiple random starts to account for inherent randomness
TWDiff    <- sapply(1:5, function(k) min(replicate(10, 
                    wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff)))
plot(TWDiff, type="b", xlab="K")

MEDseq documentation built on Dec. 28, 2022, 2:35 a.m.