Description Usage Arguments Details Value Note Author(s) References Examples
Generate cluster data sets with specified degree of separation. The separation between any cluster and its nearest neighboring cluster can be set to a specified value. The covariance matrices of clusters can have arbitrary diameters, shapes and orientations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  genRandomClust(numClust,
sepVal=0.01,
numNonNoisy=2,
numNoisy=0,
numOutlier=0,
numReplicate=3,
fileName="test",
clustszind=2,
clustSizeEq=50,
rangeN=c(50,200),
clustSizes=NULL,
covMethod=c("eigen", "onion", "cvine", "unifcorrmat"),
rangeVar=c(1, 10),
lambdaLow=1,
ratioLambda=10,
alphad=1,
eta=1,
rotateind=TRUE,
iniProjDirMethod=c("SL", "naive"),
projDirMethod=c("newton", "fixedpoint"),
alpha=0.05,
ITMAX=20,
eps=1.0e10,
quiet=TRUE,
outputDatFlag=TRUE,
outputLogFlag=TRUE,
outputEmpirical=TRUE,
outputInfo=TRUE)

numClust 
Number of clusters in a data set. 
sepVal 
Desired value of the separation index between a cluster
and its nearest neighboring cluster. Theoretically, 
numNonNoisy 
Number of nonnoisy variables. 
numNoisy 
Number of noisy variables.
The default values of 
numOutlier 
Number or ratio of outliers. If 
numReplicate 
Number of data sets to be generated for the same cluster structure specified
by the other arguments of the function 
fileName 
The first part of the names of data files that record the generated data sets
and associated information, such as cluster membership of data points, labels
of noisy variables, separation index matrix, projection directions, etc.
(see details). The default value of 
clustszind 
Cluster size indicator.

clustSizeEq 
Cluster size.
If the argument 
rangeN 
The range of cluster sizes.
If 
clustSizes 
The sizes of clusters.
If 
covMethod 
Method to generate covariance matrices for clusters (see details). The default method is 'eigen' so that the user can directly specify the range of the diameters of clusters. 
rangeVar 
Range for variances of a covariance matrix (see details). The default range is [1, 10] which can generate reasonable variability of variances. 
lambdaLow 
Lower bound of the eigenvalues of cluster covariance matrices.
If the argument “covMethod="eigen"”, we need to generate eigenvalues for cluster covariance matrices.
The eigenvalues are randomly generated from the
interval [ 
ratioLambda 
The ratio of the upper bound of the eigenvalues to the lower bound of the
eigenvalues of cluster covariance matrices.
If the argument 
alphad 
parameter for unifcorrmat method to generate random correlation matrix

eta 
parameter for “cvine” and “onion” methods to generate random correlation matrix

rotateind 
Rotation indicator.

iniProjDirMethod 
Indicating the method to get initial projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). 
projDirMethod 
Indicating the method to get the optimal projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). 
alpha 
Tuning parameter reflecting the percentage in the two
tails of a projected cluster that might be outlying.
We set 
ITMAX 
Maximum iteration allowed when iteratively calculating the optimal projection direction. The actual number of iterations is usually much less than the default value 20. 
eps 
Convergence threshold. A small positive number to check if a quantitiy q
is equal to zero. If q< 
quiet 
A flag to switch on/off the outputs of intermediate results and/or possible warning messages. The default value is 
outputDatFlag 
Indicates if data set should be output to file. 
outputLogFlag 
Indicates if log info should be output to file. 
outputEmpirical 
Indicates if empirical separation indices and projection directions should be
calculated. This option is useful when generating clusters with sizes which
are not large enough so that the sample covariance matrices may be singular.
Hence, by default, 
outputInfo 
Indicates if theoretical and empirical separation information data frames should be output to a file with format ‘[fileName]\_info.log’. 
The function genRandomClust
is an implementation of the random cluster
generation method proposed in Qiu and Joe (2006a) which improve the cluster
generation method proposed in Milligan (1985) so that the degree of separation
between any cluster and its nearest neighboring cluster could be set to a
specified value while the cluster covariance matrices can be arbitrary positive definite matrices, and so that clusters generated might not be visualized
by pairwise scatterplots of variables. The separation between a pair of
clusters is measured by the separation index proposed in Qiu and Joe (2006b).
The current version of the function genRandomClust
implements two
methods to generate covariance matrices for clusters. The first method,
denoted by eigen
, first randomly generates eigenvalues
(λ_1,…>λ_p) for the covariance matrix
(\boldsymbol{Σ}), then uses columns of a randomly generated
orthogonal matrix
(\boldsymbol{Q}=(\boldsymbol{α}_1,…,\boldsymbol{α}_p))
as eigenvectors. The covariance matrix
\boldsymbol{Σ} is then contructed as
\boldsymbol{Q}*diag(λ_1,…, λ_p)*\boldsymbol{Q}^T.
The second method, denoted as “unifcorrmax”, first generates a random
correlation matrix (\boldsymbol{R}) via the method proposed in Joe (2006),
then randomly generates variances (σ_1^2,…, σ_p^2) from
an interval specified by the argument rangeVar
. The covariance matrix
\boldsymbol{Σ} is then constructed as
diag(σ_1,…,σ_p)*\boldsymbol{R}*diag(σ_1,…,σ_p).
For each data set generated, the function genRandomClust
outputs
four files: data file, log file, membership file, and noisy set file.
All four files have the same format: ‘[fileName]\_[i].[extension]’,
where i indicates the replicate number, and ‘extension’ can be
‘dat’, ‘log’, ‘mem’, and ‘noisy’.
The data file with file extension ‘dat’ contains n+1 rows and p columns, where n is the number of data points and p is the number of variables. The first row is the variable names. The log file with file extension ‘log’ contains information such as cluster sizes, mean vectors, covariance matrices, projection directions, separation index matrices, etc. The membership file with file extension ‘mem’ contains n rows and one column of cluster memberships for data points. The noisy set file with file extension ‘noisy’ contains a row of labels of noisy variables.
When generating clusters, population covariance matrices are all
positivedefinite. However sample covariance matrices might be
semipositivedefinite due to small cluster sizes. In this case, the
function genRandomClust
will automatically use the
“fixedpoint” method to search the optimal projection direction.
The current version of the function genPositiveDefMat
implements four
methods to generate random covariance matrices. The first method, denoted by
“eigen”, first randomly generates eigenvalues
(λ_1,…,λ_p) for the covariance matrix
(\boldsymbol{Σ}), then
uses columns of a randomly generated orthogonal matrix
(\boldsymbol{Q}=(\boldsymbol{α}_1,…,\boldsymbol{α}_p))
as eigenvectors. The covariance matrix \boldsymbol{Σ} is then
contructed as
\boldsymbol{Q}*diag(λ_1,…,λ_p)*\boldsymbol{Q}^T.
The remaining methods, denoted as “onion”, “cvine”, and “unifcorrmat”
respectively, first generates a random
correlation matrix (\boldsymbol{R}) via the method mentioned and proposed in Joe (2006),
then randomly generates variances (σ_1^2,…,σ_p^2) from
an interval specified by the argument rangeVar
. The covariance matrix
\boldsymbol{Σ} is then constructed as
diag(σ_1,…,σ_p)*\boldsymbol{R}*diag(σ_1,…,σ_p).
The function outputs four data files for each data set (see details).
This function also returns separation information data frames
infoFrameTheory
and infoFrameData
based on population
and empirical mean vectors and covariance matrices of clusters for all
the data sets generated. Both infoFrameTheory
and infoFrameData
contain the following seven columns:
Column 1: 
Labels of clusters (1, 2, …, numClust), where numClust is the number of clusters for the data set. 
Column 2: 
Labels of the corresponding nearest neighbors. 
Column 3: 
Separation indices of the clusters to their nearest neighboring clusters. 
Column 4: 
Labels of the corresponding farthest neighboring clusters. 
Column 5: 
Separation indices of the clusters to their farthest neighbors. 
Column 6: 
Median separation indices of the clusters to their neighbors. 
Column 7: 
Data file names with format ‘[fileName]\_[i]’, where i indicates the replicate number. 
The function also returns three lists: datList
, memList
, and noisyList
.
datList: 
a list of data matrices for generated data sets. 
memList: 
a list of luster memberships for data points for generated data sets. 
noisyList: 
a list of sets of noisy variables for generated data sets. 
This function might be take a while to complete.
Weiliang Qiu [email protected]
Harry Joe [email protected]
Joe, H. (2006) Generating Random Correlation Matrices Based on Partial Correlations. Journal of Multivariate Analysis, 97, 2177–2189.
Milligan G. W. (1985) An Algorithm for Generating Artificial Test Clusters. Psychometrika 50, 123–127.
Qiu, W.L. and Joe, H. (2006a) Generation of Random Clusters with Specified Degree of Separaion. Journal of Classification, 23(2), 315334.
Qiu, W.L. and Joe, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
Su, J. Q. and Liu, J. S. (1993) Linear Combinations of Multiple Diagnostic Markers. Journal of the American Statistical Association, 88, 1350–1355.
Ghosh, S., Henderson, S. G. (2003). Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(3), 276–294.
Kurowicka and Cooke, 2006. Uncertainty Analysis with High Dimensional Dependence Modelling, Wiley, 2006.
1 2 3 4 5 6 7 8 9 10 11 12 13  ## Not run: tmp1 < genRandomClust(numClust=7, sepVal=0.3, numNonNoisy=5,
numNoisy=3, numOutlier=5, numReplicate=2, fileName="chk1")
## End(Not run)
## Not run: tmp2 < genRandomClust(numClust=7, sepVal=0.3, numNonNoisy=5,
numNoisy=3, numOutlier=5, numReplicate=2,
covMethod="unifcorrmat", fileName="chk2")
## End(Not run)
## Not run: tmp3 < genRandomClust(numClust=2, sepVal=0.1, numNonNoisy=2,
numNoisy=6, numOutlier=30, numReplicate=1,
clustszind=1, clustSizeEq=80, rangeVar=c(10, 20),
covMethod="unifcorrmat", iniProjDirMethod="naive",
projDirMethod="fixedpoint", fileName="chk3")
## End(Not run)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.