Description DCEM supports following initialization schemes Demonstration and Testing Understanding the output of dcem_test How to run on your dataset Package organization Author(s) References
Implements the EM* (see list of references) and EM algorithm for clustering the univariate and multivariate Gaussian mixture data.
Random Initialization: Initializes the mean randomly.
Refer meu_uv
and meu_mv
for initialization
on univariate and multivariate data respectively.
Improved Initialization: Based on the Kmeans++ idea published in,
K-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii.
URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf. See meu_uv_impr
and
meu_mv_impr
for details.
Choice of initialization scheme can be specified as the seeding
parameter during the training. See dcem_train
for further details.
Cleaning the data:
The data should be cleaned (redundant columns should be removed). For example
columns containing the labels or redundant entries (such as a column of
all 0's or 1's). See trim_data
for details on
cleaning the data. Refer: dcem_test
for more details.
dcem_test
The function dcem_test() returns a list of objects. This list contains the parameters associated with the Gaussian(s), posterior probabilities (prob), mean (meu), co-variance/standard-deviation(sigma) ,priors (prior) and cluster membership for data (membership).
Note: The routine dcem_test() is only for demonstration purpose.
The function dcem_test
calls the main routine
dcem_train
. See dcem_train
for further details.
See dcem_train
and dcem_star_train
for examples.
The package is organized as a set of preprocessing functions and the core clustering modules. These functions are briefly described below.
trim_data
: This is used to remove the columns
from the dataset. The user should clean the dataset before
calling the dcem_train routine. User can also clean the dataset themselves
(without using trim_data) and then pass it to the dcem_train function
dcem_star_train
and dcem_train
: These are the primary
interface to the EM and EM* algorithms respectively. These function accept the cleaned dataset and other
parameters (number of iterations, convergence threshold etc.) and run the algorithm until:
The number of iterations is reached.
The convergence is achieved.
Parichit Sharma parishar@iu.edu, Hasan Kurban, Mark Jenne, Mehmet Dalkilic
This work is partially supported by NCI Grant 1R01CA213466-01.
External Packages: DCEM requires R packages 'mvtnorm'[1], 'matrixcalc'[2] 'RCPP'[3] and 'MASS'[4] for multivariate density calculation, checking matrix singularity, compiling routines written in C and simulating mixture of gaussians, respectively.
For improving the initialization, ideas published in [5] is used.
[1] Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, Torsten Hothorn (2019). mvtnorm: Multivariate Normal and t Distributions. R package version 1.0-7. URL http://CRAN.R-project.org/package=mvtnorm
[2] Frederick Novomestky (2012). matrixcalc: Collection of functions for matrix calculations. R package version 1.0-3. https://CRAN.R-project.org/package=matrixcalc
[3] Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.
[4] Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
[5] K-Means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii. URL http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
Using data to build a better EM: EM* for big data.
Hasan Kurban, Mark Jenne, Mehmet M. Dalkilic (2016) <https://doi.org/10.1007/s41060-017-0062-1>.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.