VarSelLCM-package: Variable Selection for Model-Based Clustering of Mixed-Type...

Description Details Author(s) References Examples

Description

Model-based clustering with variable selection and estimation of the number of clusters. Data to analyze can be continuous, categorical, integer or mixed. Moreover, missing values can occur and do not necessitate any pre-processing. Shiny application permits an easy interpretation of the results.

Details

Package: VarSelLCM
Type: Package
Version: 2.1.2
Date: 2018-06-04
License: GPL-3
LazyLoad: yes
URL: http://varsellcm.r-forge.r-project.org/

The main function to use is VarSelCluster. Function VarSelCluster carries out the model selection (according to AIC, BIC or MICL) and maximum likelihood estimation.

Function VarSelShiny runs a shiny application which permits an easy interpretation of the clustering results.

Function VarSelImputation permits the imputation of missing values by using the model parameters.

Standard tool methods (e.g., summary, print, plot, coef, fitted, predict...) are available for facilitating the interpretation.

Author(s)

Matthieu Marbac and Mohammed Sedki. Maintainer: Mohammed Sedki <mohammed.sedki@u-psud.fr>

References

Marbac, M. and Sedki, M. (2017). Variable selection for model-based clustering using the integrated completed-data likelihood. Statistics and Computing, 27 (4), 1049-1063.

Marbac, M. and Patin, E. and Sedki, M. (2018). Variable selection for mixed data clustering: Application in human population genomics. Journal of classification, to appear.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
## Not run: 
# Package loading
require(VarSelLCM)

# Data loading:
# x contains the observed variables
# z the known statu (i.e. 1: absence and 2: presence of heart disease)
data(heart)
ztrue <- heart[,"Class"]
x <- heart[,-13]

# Cluster analysis without variable selection
res_without <- VarSelCluster(x, 2, vbleSelec = FALSE, crit.varsel = "BIC")

# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40, crit.varsel = "BIC")

# Comparison of the BIC for both models:
# variable selection permits to improve the BIC
BIC(res_without)
BIC(res_with)

# Comparison of the partition accuracy. 
# ARI is computed between the true partition (ztrue) and its estimators
# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)
# variable selection permits to improve the ARI
# Note that ARI cannot be used for model selection in clustering, because there is no true partition
ARI(ztrue, fitted(res_without))
ARI(ztrue, fitted(res_with))

# Estimated partition
fitted(res_with)

# Estimated probabilities of classification
head(fitted(res_with, type="probability"))

# Summary of the probabilities of missclassification
plot(res_with, type="probs-class")

# Confusion matrices and ARI (only possible because the "true" partition is known).
# ARI is computed between the true partition (ztrue) and its estimators
# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)
# variable selection permits to improve the ARI
# Note that ARI cannot be used for model selection in clustering, because there is no true partition
# variable selection decreases the misclassification error rate
table(ztrue, fitted(res_without))
table(ztrue, fitted(res_with))
ARI(ztrue,  fitted(res_without))
ARI(ztrue, fitted(res_with))

# Summary of the best model
summary(res_with)

# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(res_with)

# More detailed output
print(res_with)

# Print model parameter
coef(res_with)

# Boxplot for the continuous variable MaxHeartRate
plot(x=res_with, y="MaxHeartRate")

# Empirical and theoretical distributions of the most discriminative variable
# (to check that the distribution is well-fitted)
plot(res_with, y="MaxHeartRate", type="cdf")

# Summary of categorical variable
plot(res_with, y="Sex")

# Probabilities of classification for new observations 
predict(res_with, newdata = x[1:3,])

# Imputation by posterior mean for the first observation
not.imputed <- x[1,]
imputed <- VarSelImputation(res_with, x[1,], method = "sampling")
rbind(not.imputed, imputed)

# Opening Shiny application to easily see the results
VarSelShiny(res_with)



## End(Not run)

VarSelLCM documentation built on Oct. 23, 2020, 8:20 p.m.