Description Usage Arguments Details Value Author(s) References See Also Examples
HDDC is a modelbased clustering method. It is based on the Gaussian Mixture Model and on the idea that the data lives in subspaces with a lower dimension than the dimension of the original space. It uses the Expectation  Maximisation algorithm to estimate the parameters of the model.
1 2 3 4 5 6 7  hddc(data, K = 1:10, model = c("AkjBkQkDk"), threshold = 0.2,
criterion = "bic", com_dim = NULL, itermax = 200, eps = 0.001,
algo = "EM", d_select = "Cattell", init = "kmeans", init.vector,
show = getHDclassif.show(), mini.nb = c(5, 10), scaling = FALSE,
min.individuals = 2, noise.ctrl = 1e08, mc.cores = 1,
nb.rep = 1, keepAllRes = TRUE, kmeans.control = list(),
d_max = 100, subset = Inf, d)

data 
A matrix or a data frame of observations, assuming the rows are the observations and the columns the variables. Note that NAs are not allowed. 
K 
A vector of integers specifying the number of clusters for which the BIC and the parameters are to be calculated; the function keeps the parameters which maximises the 
model 
A character string vector, or an integer vector indicating the models to be used. The available models are: "AkjBkQkDk" (default), "AkBkQkDk", "ABkQkDk", "AkjBQkDk", "AkBQkDk", "ABQkDk", "AkjBkQkD", "AkBkQkD", "ABkQkD", "AkjBQkD", "AkBQkD", "ABQkD", "AjBQD", "ABQD". It is not case sensitive and integers can be used instead of names, see details for more information. Several models can be used, if it is, only the results of the one which maximizes the BIC criterion is kept. To run all models, use model="ALL". 
threshold 
A float stricly within 0 and 1. It is the threshold used in the Cattell's ScreeTest. 
criterion 
Either “BIC” or “ICL”. If several models are run, the best model is selected using the criterion defined by 
com_dim 
It is used only for common dimensions models. The user can give the common dimension s/he wants. If used, it must be an integer. Its default is set to NULL. 
itermax 
The maximum number of iterations allowed. The default is 200. 
eps 
A positive double, default is 0.001. It is the stopping criterion: the algorithm stops when the difference between two successive loglikelihoods is lower than ‘eps’. 
algo 
A character string indicating the algorithm to be used. The available algorithms are the ExpectationMaximisation ("EM"), the Classification EM ("CEM") and the Stochastic EM ("SEM"). The default algorithm is the "EM". 
d_select 
Either “Cattell” (default) or “BIC”. See details for more information. This parameter selects which method to use to select the intrinsic dimensions. 
init 
A character string or a vector of clusters. It is the way to initialize the EM algorithm. There are five possible initialization: “kmeans” (default), “param”, “random”, “miniem” or “vector”. See details for more information. It can also be directly initialized with a vector containing the prior classes of the observations. If 
init.vector 
A vector of integers or factors. It is a usergiven initialization. It should be of the same length as of the data. Only used when 
show 
Single logical. To diplay summary information on the results after the algorithm is done: set it to 
mini.nb 
A vector of integers of length two. This parameter is used in the “miniem” initialization. The first integer sets how many times the algorithm is repeated; the second sets the maximum number of iterations the algorithm will do each time. For example, if 
scaling 
Logical: whether to scale the dataset (mean=0 and standarderror=1 for each variable) or not. By default the data is not scaled. 
min.individuals 
Positive integer greater than 2 (default). This parameter is used to control for the minimum population of a class. If the population of a class becomes stricly inferior to 'min.individuals' then the algorithm stops and gives the message: 'pop<min.indiv.'. Here the meaning of "population of a class" is the sum of its posterior probabilities. The value of 'min.individuals' cannot be lower than 2. 
noise.ctrl 
This parameter avoids to have a too low value of the 'noise' parameter b. It garantees that the dimension selection process do not select too many dimensions (which leads to a potential too low value of the noise parameter b). When selecting the intrinsic dimensions using Cattell's screetest or BIC, the function doesn't use the eigenvalues inferior to noise.ctrl, so that the intrinsic dimensions selected can't be higher or equal to the order of these eigenvalues. 
mc.cores 
Positive integer, default is 1. If 
nb.rep 
A positive integer (default is 1). Each estimation (i.e. combination of (model, K, threshold)) is repeated 
keepAllRes 
Logical. Should the results of all runs be kept? If so, an argument 
kmeans.control 
A list. The elements of this list should match the parameters of the kmeans initialization (see 
d_max 
A positive integer. The maximum number of dimensions to be computed. Default is 100. It means that the instrinsic dimension of any cluster cannot be larger than 
subset 
An positive integer, default is 
d 
DEPRECATED. This parameter is kept for retro compatibility. Now please use the parameter d_select. 
Some information on the signification of the model names:
if Akj: each class has its parameters and there is one parameter for each dimension
if Ak: the classes have different parameters but there is only one per class
if Aj: all the classes have the same parameters for each dimension (it's a particular case with a common orientation matrix)
if A: all classes have the same one parameter
If Bk: each class has its proper noise
if B: all classes have the same noise
if Qk: all classes have its proper orientation matrix
if Q: all classes have the same orientation matrix
if Dk: the dimensions are free and proper to each class
if D: the dimension is common to all classes
The model “ALL” will compute all the models, give their BIC and keep the model with the highest BIC value. Instead of writing the model names, they can also be specified using an integer. 1 represents the most general model (“AkjBkQkDk”) while 14 is the most constrained (“ABQD”), the others number/name matching are given below. Note also that several models can be run at once, by using a vector of models (e.g. model = c("AKBKQKD","AKJBQKDK","AJBQD") is equivalent to model = c(8,4,13); to run the 6 first models, use model=1:6). If all the models are to be run, model="all" is faster than model=1:14.
AkjBkQkDk  1  AkjBkQkD  7  
AkBkQkDk  2  AkBkQkD  8  
ABkQkDk  3  ABkQkD  9  
AkjBQkDk  4  AkjBQkD  10  
AkBQkDk  5  AkBQkD  11  
ABQkDk  6  ABQkD  12  
AjBQD  13  ABQD  14 
The parameter d_select
, is used to select the intrinsic dimensions of the subclasses. Here are its definitions:
“Cattell”: The Cattell's screetest is used to gather the intrinsic dimension of each class. If the model is of common dimension (models 7 to 14), the screetest is done on the covariance matrix of the whole dataset.
“BIC”: The intrinsic dimensions are selected with the BIC criterion. See Bouveyron et al. (2010) for a discussion of this topic. For common dimension models, the procedure is done on the covariance matrix of the whole dataset.
Note that "Cattell" (resp. "BIC") can be abreviated to "C" (resp. "B") and that this argument is not case sensitive.
The different initializations are:
it is initialized with the parameters, the means being generated by a multivariate normal distribution and the covariance matrix being common to the whole sample
it is an initialization strategy, the classes are randomly initialized and the EM algorithm makes several iterations, this action is repetead a few times (the default is 5 iterations and 10 times), at the end, the initialization choosen is the one which maximise the loglikelihood (see mini.nb for more information about its parametrization)
the classes are randomly given using a multinomial distribution
the classes are initialized using the kmeans function (with: algorithm="HartiganWong"; nstart=4; iter.max=50); note that the user can use his own arguments for kmeans using the dotdotdot argument
It can also be directly initialized with a vector containing the prior classes of the observations. To do so use init="vector"
and provide the vector in the argument init.vector
.
The BIC criterion used in this function is to be maximized and is defined as 2*LLk*log(n) where LL is the loglikelihood, k is the number of parameters and n is the number of observations.
hddc returns an 'hdc' object; it's a list containing:
model 
The name of the model. 
K 
The number of classes. 
d 
The dimensions of each class. 
a 
The parameters of each class subspace. 
b 
The noise of each class subspace. 
mu 
The mean of each variable for each class. 
prop 
The proportion of each class. 
ev 
The eigen values of the var/covar matrix. 
Q 
The orthogonal matrix of orientation of each class. 
loglik 
The loglikelihood. 
loglik_all 
The loglikelihood of all iterations. Note that if 
posterior 
The matrix of the probabilities to belong to a class for each observation and each class. 
class 
The class vector obtained by the clustering. 
com_ev 
Only if this is a common dimension model. The eigenvalues of the var/covar matrix of the whole dataset. 
N 
The number of observations. 
complexity 
The number of parameters of the model. 
threshold 
The threshold used for the Cattell screetest. 
d_select 
The way the dimensions were selected. 
BIC 
The BIC of the model. 
ICL 
The ICL of the model. 
criterion 
The criterion used to select the model. 
call 
The call. 
allCriteria 
The data.frame with the combination (model, K, threshold) and the associated values of the likelihood (LL), BIC and ICL, as well as the rank of each of the models with respect to the selection criterion. It also reports the original order in which were estimated the models as well as each model complexity 
all_results 
Only if 
scaling 
Only if 
id_subset 
Only if 
Laurent Berge, Charles Bouveyron and Stephane Girard
Bouveyron, C. Girard, S. and Schmid, C. (2007) “HighDimensional Data Clustering”, Computational Statistics and Data Analysis, vol. 52 (1), pp. 502–519
Berge, L. Bouveyron, C. and Girard, S. (2012) “HDclassif: An R Package for ModelBased Clustering and Discriminant Analysis of HighDimensional Data”, Journal of Statistical Software, 46(6), 1–29, url: http://www.jstatsoft.org/v46/i06/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56  # Example 1:
data < simuldata(1000, 1000, 50)
X < data$X
clx < data$clx
Y < data$Y
cly < data$cly
#clustering of the simulated dataset:
prms1 < hddc(X, K=3, algo="CEM", init='param')
#class vector obtained by the clustering:
prms1$class
#We can look at the adjusted rand index to assess the goodness of fit
res1 < predict(prms1, X, clx)
res2 < predict(prms1, Y)
#the class predicted using hddc parameters on the test dataset:
res2$class
# Example 2:
data(Crabs)
# clustering of the Crabs dataset:
prms3 < hddc(Crabs[,1], K=4, algo="EM", init='miniem')
res3 < predict(prms3, Crabs[,1], Crabs[,1])
# another example using the Crabs dataset
prms4 < hddc(Crabs[,1], K=1:8, model=c(1,2,7,9))
# model=c(1,2,7,9) is equivalent to:
# model=c("AKJBKQKDK","AKBKQKDK","AKJBKQKD"#' ,"ABKQKD")
res4 < predict(prms4, Crabs[,1], Crabs[,1])
# PARALLEL COMPUTING
## Not run:
# Same example but with Parallel Computing => platform specific
# (slower for Windows users)
# To enable it, just use the argument 'mc.cores'
prms5 < hddc(Crabs[,1], K=1:8, model=c(1,2,7,9), mc.cores=2)
## End(Not run)
# LARGE DATASETS
# Assume you have a very large data set
# => you can use the argument 'subset' to obtain quick results:
## Not run:
# we take a subset of 10000 observations and run hddc
# once the classification is done, the posterior is computed
# on the full data
prms = hddc(bigData, subset = 10000)
# You obtain a much faster (although less precise)
# classification of the full dataset:
table(prms$class)
## End(Not run)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.