title: "VarSelLCM"
author: "Variable Selection for Model-Based Clustering of Mixed-Type Data Set with Missing Values"
date: "`r Sys.Date()`

"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette VarSelLCM}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}

knitr::opts_chunk$set(echo = TRUE, include = TRUE, cache = TRUE,comment="", cache.lazy = FALSE,out.width = "70%", fig.align = "center")

**References:**

- Marbac, M. and Sedki, M. (2017), Variable selection for model-based clustering using the integrated complete-data likelihood, Statistics and Computing, Volume 27, Issue 4, pp 1049–1063.
- Marbac, M., Patin, E. and Sedki, M. (2018), Variable selection for mixed data clustering: Application in human population genomics, Journal of Classification, to appear.

This section performs the whole analysis of the *Heart* data set. *Warning the univariate margin distribution are defined by class of the features: numeric columns imply Gaussian distributions, integer columns imply Poisson distribution while factor (or ordered) columns imply multinomial distribution*

library(VarSelLCM) # Data loading: # x contains the observed variables # z the known status (i.e. 1: absence and 2: presence of heart disease) data(heart) ztrue <- heart[,"Class"] x <- heart[,-13] # Add a missing value artificially (just to show that it works!) x[1,1] <- NA

Clustering is performed with variable selection. Model selection is done with BIC because the number of observations is large (compared to the number of features). The number of components is between 1 and 3. Do not hesitate to use parallelization (here only four cores are used).

# Cluster analysis without variable selection res_without <- VarSelCluster(x, gvals = 1:3, vbleSelec = FALSE, crit.varsel = "BIC") # Cluster analysis with variable selection (with parallelisation) res_with <- VarSelCluster(x, gvals = 1:3, nbcores = 4, crit.varsel = "BIC")

Comparison of the BIC for both models: variable selection permits to improve the BIC

BIC(res_without) BIC(res_with)

Evaluation of the partition accuracy: Adjusted Rand Index (ARI) is computed between the true partition (ztrue) and its estimators. The expectation of ARI is zero if the two partitions are independent. The ARI is equal to one if the partitions are equals. Variable selection permits to improve the ARI. Note that ARI cannot be used for model selection in clustering, because there is no true partition.

ARI(ztrue, fitted(res_without)) ARI(ztrue, fitted(res_with))

To obtained the partition and the probabilities of classification

# Estimated partition fitted(res_with) # Estimated probabilities of classification head(fitted(res_with, type="probability"))

To get a summary of the selected model.

# Summary of the best model summary(res_with)

Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate). The greater this index, the more the variable distinguishes the clusters.

```
plot(res_with)
```

Distribution of the most discriminative variable per clusters

# Boxplot for the continuous variable MaxHeartRate plot(x=res_with, y="MaxHeartRate")

Empirical and theoretical distributions of the most discriminative variable (to check that the distribution is well-fitted)

# Empirical and theoretical distributions (to check that the distribution is well-fitted) plot(res_with, y="MaxHeartRate", type="cdf")

Distribution of a categorical variable per clusters

# Summary of categorical variable plot(res_with, y="Sex")

To have details about the selected model

# More detailed output print(res_with)

To print the parameters

# Print model parameter coef(res_with)

Probabilities of classification for new observations

# Probabilities of classification for new observations predict(res_with, newdata = x[1:3,])

The model can be used for imputation (of the clustered data or of a new observation)

# Imputation by posterior mean for the first observation not.imputed <- x[1,] imputed <- VarSelImputation(res_with, x[1,], method = "sampling") rbind(not.imputed, imputed)

All the results can be analyzed by the Shiny application...

# Start the shiny application VarSelShiny(res_with)

... but this analysis can also be done on R.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.