kproto | R Documentation |
Computes k-prototypes clustering for mixed-type data.
kproto(x, ...) ## Default S3 method: kproto( x, k, lambda = NULL, type = "standard", iter.max = 100, nstart = 1, na.rm = "yes", keep.data = TRUE, verbose = TRUE, ... )
x |
Data frame with both numerics and factors. |
... |
Currently not used. |
k |
Either the number of clusters, a vector specifying indices of initial prototypes, or a data frame of prototypes of the same columns as |
lambda |
Parameter > 0 to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables. Also a vector of variable specific factors is possible where the order must correspond to the order of the variables in the data. In this case all variables' distances will be multiplied by their corresponding lambda value. |
type |
Character, to specify the distance for clustering. Either |
iter.max |
Maximum number of iterations if no convergence before. |
nstart |
If > 1 repetitive computations with random initializations are computed and the result with minimum tot.dist is returned. |
na.rm |
Character; Either "yes" to strip NA values for complete case analysis, "no" to keep and ignore NA values, "imp.internal" to impute the NAs within the algorithm or "imp.onestep" to apply the algorithm ignoring the NAs and impute them after the partition is determined. |
keep.data |
Logical whether original should be included in the returned object. |
verbose |
Logical whether additional information about process should be printed.
Caution: For |
The algorithm like k-means iteratively recomputes cluster prototypes and reassigns clusters.
For type = "standard"
clusters are assigned using d(x,y) = d_{euclid}(x,y) + λ d_{simple\,matching}(x,y).
Cluster prototypes are computed as cluster means for numeric variables and modes for factors
(cf. Huang, 1998).
Ordered factors variables are treated as categorical variables.
In case of na.rm = FALSE
: for each observation variables with missings are ignored
(i.e. only the remaining variables are considered for distance computation).
In consequence for observations with missings this might result in a change of variable's weighting compared to the one specified
by lambda
. For these observations distances to the prototypes will typically be smaller as they are based
on fewer variables.
For type = "gower"
cf. kproto_gower
.
kmeans
like object of class kproto
:
cluster |
Vector of cluster memberships. |
centers |
Data frame of cluster prototypes. |
lambda |
Distance parameter lambda. |
type |
Type argument of the function call. |
size |
Vector of cluster sizes. |
withinss |
Vector of within cluster distances for each cluster, i.e. summed distances of all observations belonging to a cluster to their respective prototype. |
tot.withinss |
Target function: sum of all observations' distances to their corresponding cluster prototype. |
dists |
Matrix with distances of observations to all cluster prototypes. |
iter |
Prespecified maximum number of iterations. |
stdization |
Only returned for |
trace |
List with two elements (vectors) tracing the iteration process:
|
Szepannek, G. (2018): clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal 10/2, 200-208, doi: 10.32614/RJ-2018-048.
Aschenbruck, R., Szepannek, G., Wilhelm, A. (2022): Imputation Strategies for Clustering Mixed‑Type Data with Missing Values, Journal of Classification, doi: 10.1007/s00357-022-09422-y.
Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.
# generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k-prototypes kpres <- kproto(x, 4) clprofiles(kpres, x) # in real world clusters are often not as clear cut # by variation of lambda the emphasize is shifted towards factor / numeric variables kpres <- kproto(x, 2) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 0.1) clprofiles(kpres, x) kpres <- kproto(x, 2, lambda = 25) clprofiles(kpres, x)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.