cluster.Gen | R Documentation |
Random cluster generation with known structure of clusters (optionally with noisy variables and outliers)
cluster.Gen(numObjects=50, means=NULL, cov=NULL, fixedCov=TRUE,
model=1, dataType="m",numCategories=NULL,
numNoisyVar=0, numOutliers=0, rangeOutliers=
c(1,10), inputType="csv2", inputHeader=TRUE,
inputRowNames=TRUE, outputCsv="", outputCsv2="",
outputColNames=TRUE, outputRowNames=TRUE)
numObjects |
number of objects in each cluster - positive integer value or vector with the same size as nrow(means),
e.g. |
means |
matrix of cluster means (e.g. |
cov |
covariance matrix (the same for each cluster, e.g. cov_<modelNumber>.csv file.
Note: you cannot use this argument for generation of clusters with different covariance matrices.
Those kind of generation should be done by setting |
model |
model number,
$R_HOME\library\clusterSim\pdf\clusterGen_details.pdf;
means_<modelNumber>.csv and covariance matrix for all clusters should be read
from cov_<modelNumber>.csv and if means_<modelNumber>.csv and covariance matrices should be read separately for each cluster from cov_<modelNumber>_<clusterNumber>.csv |
fixedCov |
if
|
dataType |
"m" - metric (ratio, interval), "o" - ordinal, "s" - symbolic interval |
numCategories |
number of categories (for ordinal data only). Positive integer value or vector with the same size as ncol(means) plus number of noisy variables. |
numNoisyVar |
number of noisy variables. For |
numOutliers |
number of outliers (for metric and symbolic interval data only). If a positive integer - number of outliers, if value from <0,1> - percentage of outliers in whole data set |
rangeOutliers |
range for outliers (for metric and symbolic interval data only). The default range is [1, 10].The outliers are generated independently for each variable for the whole data set from uniform distribution. The generated values are randomly added to maximum of j-th variable or subtracted from minimum of j-th variable |
inputType |
"csv" - a dot as decimal point or "csv2" - a comma as decimal point in means_<modelNumber>.csv and cov_<modelNumber>.csv files |
inputHeader |
cov_<modelNumber...>.csv) contain header row |
inputRowNames |
|
outputCsv |
optional, name of csv file with generated data (first column contains id, second - number of cluster and others - data) |
outputCsv2 |
optional, name of csv (a comma as decimal point and a semicolon as field separator) file with generated data (first column contains id, second - number of cluster and others - data) |
outputColNames |
|
outputRowNames |
|
See file $R_HOME\library\clusterSim\pdf\clusterGen_details.pdf for further details
clusters |
cluster number for each object, for |
data |
generated data: for metric and ordinal data - matrix with objects in rows and variables in columns; for symbolic interval data three-dimensional structure: first dimension represents object number, second - variable number and third dimension contains lower- and upper-bounds of intervals |
Marek Walesiak marek.walesiak@ue.wroc.pl, Andrzej Dudek andrzej.dudek@ue.wroc.pl
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland
Billard, L., Diday, E. (2006), Symbolic data analysis. Conceptual statistics and data mining, Wiley, Chichester. ISBN: 978-0-470-09016-9.
Qiu, W., Joe, H. (2006), Generation of random clusters with specified degree of separation, "Journal of Classification", vol. 23, 315-334. Available at: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s00357-006-0018-y")}.
Steinley, D., Henson, R. (2005), OCLUS: an analytic method for generating clusters with known overlap, "Journal of Classification", vol. 22, 221-250. Available at: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s00357-005-0015-6")}.
Walesiak, M., Dudek, A. (2008), Identification of noisy variables for nonmetric and symbolic data in cluster analysis, In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Decker (Eds.), Data analysis, machine learning and applications, Springer-Verlag, Berlin, Heidelberg, 85-92. Available at: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/978-3-540-78246-9_11")}.
Walesiak, M. (2016), Uogólniona miara odległości GDM w statystycznej analizie wielowymiarowej z wykorzystaniem programu R. Wydanie 2 poprawione i rozszerzone [The Generalized Distance Measure GDM in multivariate statistical analysis with R], Wydawnictwo Uniwersytetu Ekonomicznego, Wroclaw.
# Example 1
library(clusterSim)
means <- matrix(c(0,7,0,7),2,2)
cov <- matrix(c(1,0,0,1),2,2)
grnd <- cluster.Gen(numObjects=60,means=means,cov=cov,model=2,
numOutliers=8)
colornames <- c("red","blue","green")
grnd$clusters[grnd$clusters==0]<-length(colornames)
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)
# Example 2
library(clusterSim)
grnd <- cluster.Gen(50,model=4,dataType="m",numNoisyVar=2)
data <- as.matrix(grnd$data)
colornames <- c("red","blue","green")
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)
# Example 3
library(clusterSim)
grnd<-cluster.Gen(50,model=4,dataType="o",numCategories=7, numNoisyVar=2)
plotCategorial(grnd$data,,grnd$clusters,ask=TRUE)
# Example 4 (1 nonnoisy variable and 2 noisy variables, 3 clusters)
library(clusterSim)
grnd <- cluster.Gen(c(40,60,20), model=2, means=c(2,14,25),
cov=c(1.5,1.5,1.5),numNoisyVar=2)
colornames <- c("red","blue","green")
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)
# Example 5
library(clusterSim)
grnd <- cluster.Gen(c(20,35,20,25),model=14,dataType="m",numNoisyVar=1,
fixedCov=FALSE, numOutliers=0.1)
# or
#grnd <- cluster.Gen(c(20,35,20,25),model=14,dataType="m",numNoisyVar=1,
#fixedCov=FALSE, numOutliers=0.1, outputCsv2="data14.csv")
data <- as.matrix(grnd$data)
colornames <- c("red","blue","green","brown","black")
grnd$clusters[grnd$clusters==0]<-length(colornames)
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)
# Example 6 (this example needs files means_24.csv)
# and cov_24.csv to be placed in working directory
# library(clusterSim)
# grnd<-cluster.Gen(c(50,80,20),model=24,dataType="m",numNoisyVar=1,
# numOutliers=10, rangeOutliers=c(1,5))
# print(grnd)
# data <- as.data.frame(grnd$data)
# colornames<-c("red","blue","green","brown")
# grnd$clusters[grnd$clusters==0]<-length(colornames)
# plot(data,col=colornames[grnd$clusters],ask=TRUE)
# Example 7 (this example needs files means_25.csv and cov_25_1.csv)
# cov_25_2.csv, cov_25_3.csv, cov_25_4.csv, cov_25_5.csv
# to be placed in working directory
# library(clusterSim)
# grnd<-cluster.Gen(c(40,30,20,35,45),model=25,numNoisyVar=3,fixedCov=F)
# data <- as.data.frame(grnd$data)
# colornames<-c("red","blue","green","magenta","brown")
# plot(data,col=colornames[grnd$clusters],ask=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.