DCEM: DCEM: Data clustering through Expectation-Maximization...

Description DCEM supports following initialization schemes Demonstration and Testing Understanding the output of dcem_test How to run on your dataset Package organization Author(s) References


Implements the EM* (see list of references) and EM algorithm for clustering the univariate and multivariate Gaussian mixture data.

DCEM supports following initialization schemes

  1. Random Initialization: Initializes the mean randomly. Refer meu_uv and meu_mv for initialization on univariate and multivariate data respectively.

  2. Improved Initialization: Based on the Kmeans++ idea published in, K-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii. URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf. See meu_uv_impr and meu_mv_impr for details.

  3. Choice of initialization scheme can be specified as the seeding parameter during the training. See dcem_train for further details.

Demonstration and Testing

Cleaning the data: The data should be cleaned (redundant columns should be removed). For example columns containing the labels or redundant entries (such as a column of all 0's or 1's). See trim_data for details on cleaning the data. Refer: dcem_test for more details.

Understanding the output of dcem_test

The function dcem_test() returns a list of objects. This list contains the parameters associated with the Gaussian(s), posterior probabilities (prob), mean (meu), co-variance/standard-deviation(sigma) ,priors (prior) and cluster membership for data (membership).

Note: The routine dcem_test() is only for demonstration purpose. The function dcem_test calls the main routine dcem_train. See dcem_train for further details.

How to run on your dataset

See dcem_train and dcem_star_train for examples.

Package organization

The package is organized as a set of preprocessing functions and the core clustering modules. These functions are briefly described below.

  1. trim_data: This is used to remove the columns from the dataset. The user should clean the dataset before calling the dcem_train routine. User can also clean the dataset themselves (without using trim_data) and then pass it to the dcem_train function

  2. dcem_star_train and dcem_train: These are the primary interface to the EM and EM* algorithms respectively. These function accept the cleaned dataset and other parameters (number of iterations, convergence threshold etc.) and run the algorithm until:

    1. The number of iterations is reached.

    2. The convergence is achieved.


Parichit Sharma parishar@iu.edu, Hasan Kurban, Mark Jenne, Mehmet Dalkilic

This work is partially supported by NCI Grant 1R01CA213466-01.

External Packages: DCEM requires R packages 'mvtnorm'[1], 'matrixcalc'[2] 'RCPP'[3] and 'MASS'[4] for multivariate density calculation, checking matrix singularity, compiling routines written in C and simulating mixture of gaussians, respectively.

For improving the initialization, ideas published in [5] is used.

[1] Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, Torsten Hothorn (2019). mvtnorm: Multivariate Normal and t Distributions. R package version 1.0-7. URL http://CRAN.R-project.org/package=mvtnorm

[2] Frederick Novomestky (2012). matrixcalc: Collection of functions for matrix calculations. R package version 1.0-3. https://CRAN.R-project.org/package=matrixcalc

[3] Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.

[4] Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

[5] K-Means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii. URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf


Using data to build a better EM: EM* for big data.

Hasan Kurban, Mark Jenne, Mehmet M. Dalkilic (2016) <https://doi.org/10.1007/s41060-017-0062-1>.

parichit/DCEM documentation built on Aug. 8, 2020, 1:17 a.m.