Implements the EM* (see list of references) and EM algorithm for clustering the univariate and multivariate Gaussian mixture data.
Random Initialization: Initializes the mean randomly.
meu_mv for initialization
on univariate and multivariate data respectively.
Improved Initialization: Based on the Kmeans++ idea published in,
K-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii.
URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf. See
meu_mv_impr for details.
Choice of initialization scheme can be specified as the seeding
parameter during the training. See
dcem_train for further details.
Cleaning the data:
The data should be cleaned (redundant columns should be removed). For example
columns containing the labels or redundant entries (such as a column of
all 0's or 1's). See
trim_data for details on
cleaning the data. Refer:
dcem_test for more details.
The function dcem_test() returns a list of objects. This list contains the parameters associated with the Gaussian(s), posterior probabilities (prob), mean (meu), co-variance/standard-deviation(sigma) ,priors (prior) and cluster membership for data (membership).
Note: The routine dcem_test() is only for demonstration purpose.
dcem_test calls the main routine
dcem_train for further details.
dcem_star_train for examples.
The package is organized as a set of preprocessing functions and the core clustering modules. These functions are briefly described below.
trim_data: This is used to remove the columns
from the dataset. The user should clean the dataset before
calling the dcem_train routine. User can also clean the dataset themselves
(without using trim_data) and then pass it to the dcem_train function
dcem_train: These are the primary
interface to the EM and EM* algorithms respectively. These function accept the cleaned dataset and other
parameters (number of iterations, convergence threshold etc.) and run the algorithm until:
The number of iterations is reached.
The convergence is achieved.
Parichit Sharma firstname.lastname@example.org, Hasan Kurban, Mark Jenne, Mehmet Dalkilic
This work is partially supported by NCI Grant 1R01CA213466-01.
External Packages: DCEM requires R packages 'mvtnorm', 'matrixcalc' 'RCPP' and 'MASS' for multivariate density calculation, checking matrix singularity, compiling routines written in C and simulating mixture of gaussians, respectively.
For improving the initialization, ideas published in  is used.
 Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, Torsten Hothorn (2019). mvtnorm: Multivariate Normal and t Distributions. R package version 1.0-7. URL http://CRAN.R-project.org/package=mvtnorm
 Frederick Novomestky (2012). matrixcalc: Collection of functions for matrix calculations. R package version 1.0-3. https://CRAN.R-project.org/package=matrixcalc
 Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.
 Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
 K-Means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii. URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
Using data to build a better EM: EM* for big data.
Hasan Kurban, Mark Jenne, Mehmet M. Dalkilic (2016) <https://doi.org/10.1007/s41060-017-0062-1>.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.