# cluster.Sim: Determination of optimal clustering procedure for a data set In clusterSim: Searching for Optimal Clustering Procedure for a Data Set

 cluster.Sim R Documentation

## Determination of optimal clustering procedure for a data set

### Description

Determination of optimal clustering procedure for a data set by varying all combinations of normalization formulas, distance measures, and clustering methods

### Usage

```cluster.Sim (x,p,minClusterNo,maxClusterNo,icq="S",outputHtml="",
outputCsv="",outputCsv2="",normalizations=NULL,
distances=NULL,methods=NULL)
```

### Arguments

 `x` matrix or dataset `p` path of simulation: 1 - ratio data, 2 - interval or mixed (ratio & interval) data, 3 - ordinal data, 4 - nominal data, 5 - binary data, 6 - ratio data without normalization, 7 - interval or mixed (ratio & interval) data without normalization, 8 - ratio data with k-means, 9 - interval or mixed (ratio & interval) data with k-means `minClusterNo` minimal number of clusters, between 2 and no. of objects - 1 (for G3 or C: no. of objects - 2) `maxClusterNo` maximal number of clusters, between 2 and no. of objects - 1 (for G3 or C: no. of objects - 2; for KL: no. of objects - 3), greater or equal minClusterNo `icq` Internal cluster quality index, "S" - Silhouette,"G1" - Calinski & Harabasz index, "G2" - Baker & Hubert index ,"G3" - G3 index,"C" - C index, "KL" - Krzanowski & Lai index `outputHtml` optional, name of html file with results `outputCsv` optional, name of csv file with results `outputCsv2` optional, name of csv (comma as decimal point sign) file with results `normalizations` optional, vector of normalization formulas that should be used in procedure `distances` optional, vector of distance measures that should be used in procedure `methods` optional, vector of classification methods that should be used in procedure

### Details

Parameter `normalizations` for each path may be the subset of the following values

path 1: "n6" to "n11" (if measurement scale of variables is ratio and transformed measurement scale of variables is ratio) or "n1" to "n5" (if measurement scale of variables is ratio and transformed measurement scale of variables is interval)

path 2: "n1" to "n5"

path 3 to 7 : "n0"

path 8: "n1" to "n11"

path 9: "n1" to "n5"

Parameter `distances` for each path may be the subset of the following values

path 1: "d1" to "d7" (if measurement scale of variables is ratio and transformed measurement scale of variables is ratio) or "d1" to "d5" (if measurement scale of variables is ratio and transformed measurement scale of variables is interval)

path 2: "d1" to "d5"

path 3: "d8"

path 4: "d9"

path 5: "b1" to "b10"

path 6: "d1" to "d7"

path 7: "d1" to "d5"

path 8 and 9: N.A.

Parameter `methods` for each path may be the subset of the following values

path 1 to 7 : "m1" to "m8"

path 8: "m9"

path 9: "m9"

See file ../doc/clusterSim_details.pdf for further details

### Value

 `result` optimal value of icq for all classifications `normalization` normalization used to obtain optimal value of icq `distance` distance measure used to obtain optimal value of icq `method` clustering method used to obtain optimal value of icq `classes` number of clusters for optimal value of icq `time` time of all calculations for path

### Author(s)

Marek Walesiak marek.walesiak@ue.wroc.pl, Andrzej Dudek andrzej.dudek@ue.wroc.pl

Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/clusterSim/

### References

Everitt, B.S., Landau, E., Leese, M. (2001), Cluster analysis, Arnold, London. ISBN 9780340761199.

Gatnar, E., Walesiak, M. (Eds.) (2004), Metody statystycznej analizy wielowymiarowej w badaniach marketingowych [Multivariate statistical analysis methods in marketing research], Wydawnictwo AE, Wroclaw, p. 338. Available at: http://keii.ue.wroc.pl/pracownicy/mw/2004_Gatnar_Walesiak_Metody_SAW_w_badaniach_marketingowych.pdf.

Gordon, A.D. (1999), Classification, Chapman & Hall/CRC, London. ISBN 9781584880134.

Milligan, G.W., Cooper, M.C. (1985), An examination of procedures of determining the number of cluster in a data set, "Psychometrika", vol. 50, no. 2, 159-179. Available at: doi: 10.1007/BF02294245.

Milligan, G.W., Cooper, M.C. (1988), A study of standardization of variables in cluster analysis, "Journal of Classification", vol. 5, 181-204. Available at: doi: 10.1007/BF01897163.

Walesiak, M., Dudek, A. (2006), Symulacyjna optymalizacja wyboru procedury klasyfikacyjnej dla danego typu danych - oprogramowanie komputerowe i wyniki badan, Prace Naukowe AE we Wroclawiu, 1126, 120-129. Available at: http://keii.ue.wroc.pl/pracownicy/mw/2006_Walesiak_Dudek_Taksonomia_13_PN_AE_1126.pdf.

Walesiak, M., Dudek, A. (2007), Symulacyjna optymalizacja wyboru procedury klasyfikacyjnej dla danego typu danych - charakterystyka problemu, Zeszyty Naukowe Uniwersytetu Szczecinskiego nr 450, 635-646. Available at: http://keii.ue.wroc.pl/pracownicy/mw/2007_Walesiak_Dudek_Symulacyjna_optymalizacja_wyboru.pdf.

`data.Normalization`, `dist.GDM`, `dist.BC`, `dist.SM`, `index.G1`, `index.G2`,

`index.G3`, `index.C`, `index.S`, `index.KL`, `hclust`, `dist`,

### Examples

```library(clusterSim)
# Commented due to long execution time
#data(data_ratio)
#cluster.Sim(data_ratio, 1, 2, 3, "G1", outputCsv="results1")
#data(data_interval)
#cluster.Sim(data_interval, 2, 2, 4, "G1", outputHtml="results2")
#data(data_ordinal)
#cluster.Sim(data_ordinal, 3, 2, 4,"G2", outputCsv2="results3")
#data(data_nominal)
#cluster.Sim(data_nominal, p=4, 2, 4, icq="G3", outputHtml="results4", methods=c("m2","m3","m5"))
#data(data_binary)
#cluster.Sim(data_binary, p=5, 2, 4, icq="S", outputHtml="results5", distances=c("b1","b3","b6"))
#data(data_ratio)
#cluster.Sim(data_ratio, 1, 2, 4,"G1", outputCsv="results6",normalizations=c("n1","n3"),
#distances=c("d2","d5"),methods=c("m5","m3","m1"))
```

clusterSim documentation built on May 25, 2022, 9:09 a.m.