knitr::opts_chunk$set(echo = TRUE, message=FALSE, warnings=FALSE)

Introduction

This is a demonstration of using the R package ClustersAnalysis. You will see how to analyze classes according to one or more variables. The group variable must be of the type factor or character and the exploratory variables can be quantitative or qualitative. In this demonstration we are going to use natives dataset from R such as "iris", "infert" or "esoph".

Short Descriptions of datasets

Iris :

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

summary(iris)

Infert :

This is a matched case-control study dating from before the availability of conditional logistic regression.

summary(infert)

Esoph :

Data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. This is a data frame with records for 88 age/alcohol/tobacco combinations.

summary(esoph)

CO2 :

The CO2 data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.

summary(CO2)

Import ClustersAnalysis from Github (using devtools)

Please download the latest R version (>4.0) before running the code below.

#Use the below line to install devtools if necessary
#install.packages("devtools")
#library(devtools)

#install package from github or download the folder and open it with R
#Use the below line to install ClustersAnalysis if necessary
#devtools::install_github("clepadellec/ClustersAnalysis")

#load package
library(ClustersAnalysis)

How to access to help

You can juste use the fonction help(function name) to see all the documentation about your function.

help("u_plot_size_effect")

Univariate Analysis for qualitatives variariables

To begin we will try to understand, for each qualitative explanatory variable, if it affects the group variable. It's necessary to create an object of univariate type. You can use the constructor Univariate_object.

#Creation of univariate object using esoph dataframe and "agegp" (first column) as the group variable
u_esoph<-Univariate_object(esoph,1)

#detail of attributes associated with the object
print(u_esoph)

Contingency table and size effect

The first thing we can do is to create a contingency table between the explanatory variable and the group variable and then visualize lines/columns profils. In our case the explanatory variable is "tobgp" which is the tobacco consumption (gm/day). To do this you can use the u_desc_profils

#use interact=TRUE to show an interactive graphique with widgets like zoom, comparisons...
ClustersAnalysis::u_desc_profils(u_esoph,3,interact=FALSE)

The distributions don't show any particular phenomena. The most represented classes are the 35-44 and 55-64 years. Then we can see that there are more than a half that smokes less than 20 g/days. The only fact that we can see is that the 75+ people are close to 75% to don't smoke a lot.

Now we are going to see in details if there is a size effect between these two variables. To do this we can use u_desc_size_effect which return the test statistic vt (comparison between proportions). Then we can also use u_plot_size_effect which create a mosaic plot.

u_desc_size_effect(u_esoph,3)
#use intecract=TRUE to show an interactive graphique with widgets like zoom, comparisons...
u_plot_size_effect(u_esoph,3,interact=FALSE)

If we refer to the results we can see that the biggest "vt" values are for 75+ peoples who smokes 0-9g/day or 10-19 g/day. So this is for these two combinations that one can most easily conclude that there is an over-representation. We can use the mosaic plot to confirm our comment.

You can do the sames analyses for the variable "tobgp".

Chisq test

We can now use u_chisq_test_all which is use to calculate the p-avlue for a chisq test between the group variable and each qualitatives variables.

u_chisq_test_all(u_esoph)

Here we can see that the p-avlues are very high which mean that we accept the null hypothesis and the variables are independant. Because of this, it's useless to calculate v-cramer.

If you want to see significant results you can run the code below, the aim is to predict the sexe according to many qualitatives attributs. We can see that for each chisq.test, we can conclude that the sex depends on the attributs. But the attributs which has the most influence is "qualif".

#install.packages("questionr")
#library(questionr)
# data(hdv2003)
# d<- hdv2003[,3:8]
# u_chisq_test_all(Univariate_object(d,1))

Correspondence analysis

To better visualize if some groups are close to a particular modality we can use correspondence analysis. The u_afc_plot function return a biplot to visualize our results.

u_afc_plot((Univariate_object(esoph,1)),3)

In our case the principal information that we can see is that the 25-34 peoples are quite close to 30+g/day. It's a new information that we didn’t have before.

Univariate Analysis for quantitatives variables

Now we are going to use the famous iris dataset. The aim si to understand how the Species can be described.

#Creation of univariate object with the "Species" as group variable
u_iris <-Univariate_object(iris,5)

Means comparisons (using student test)

To begin we are going to search for each group and each explanatory variable if the fact to belong to a species or not have an influence on the length and the width of Petal/Sepal. To do this we use u_ttest_all, this function for each group, separates the group variable into two modalities : target modality or others modality. Then we verify normality hypothesis and if it's possible we use a student test to define if it's significative or not.

 u_ttest_all(u_iris)

Here we can see that each group have one or many significatives variables which mean that the fact to belong to setosa,versicolor or virginica have an influence on the attributs.

Fisher Test on correlation

The aim is to established if there is correlation between group variable and an other quantitative variable.

u_fisher_test_all(u_iris)

Here we can see that each variable has a p-value lower than 0.05 which mean that there is a link between the group variable and quantitatives data. Moreover we can see that "eta2" which are the correlation values are quite high.

Multivariate Analysis

In this section, we characterize the group variable by using several qualitative or quantitative explanatory variables. If the data contains categorical variables, we encode it to numbers by using one-hot encoding. This is an intermediate step of the functions in our package, therefore our functions can take as input a qualitative data, quantitative data or mixed data.

To begin we compute the proportion between inter-class inertia and total inertia. It's necessary to create an object of multivariate type as Univariate Analysis part. You can use the constructor Multivariate_object.

#Creation of multivariate object in case of quantitative data using iris dataframe and
# "Species" (fifth column) as the group variable
m_iris<-multivariate_object(iris,5)

#Creation of multivariate object in case of mixed data using iris dataframe and 
# "Type" (second column) as the group variable
m_CO2<-multivariate_object(data.frame(CO2),2)

#detail of attributes associated with the object
#print(m_iris)
#print(m_CO2)

Proportion between inter-class inertia and total inertia (R square)

Starting with the quantitative case in Iris data, we have a high R square (0.87) which results a clear categorization of three types of the Iris flower by four explanatory variables.

#quantitative data
m_R2_multivariate(m_iris)

Turning to mixed data, we encode qualitative explanatory variables to numbers by using one-hot encoding. Also, we rescale quantitative explanatory ones (rescale = TRUE). R square is not high in this case.

#mixed data
m_R2_multivariate(m_CO2, rescale = TRUE)

Internal measures

Two internal measures are investigated in this section in order to evaluate the goodness of a clustering structure without respect to external information including Silhouette index and Davies Boulin index.

Silhouette index

Silhouette coefficient of Iris data is high for all three species (setorsa, versicolor and virginica), which unifies results of R square in the previous part. In other words, these four explanatory variables clearly categorize three types of flowers mentioned above. Moreover, silhouette coefficient of setosa is extremely higher than that of two types rest. This results provide that setosa is separated to the rest, which can be observed by graphic "ACP and Silhouette" representing individuals in the factorial plan in which different shapes represent variables groups, at the meanwhile, distinguish colors represent silhouette coefficients.

m_silhouette_ind(m_iris)
m_silhouette_plot(m_iris, interact = FALSE)
m_sil_pca_plot(m_iris, interact = FALSE)

In contrast, Silhouette coefficient of CO2 data is low for the origin of the plant (Quebec and Mississippi) unifying results of R square above.

m_silhouette_ind(m_CO2,rescale = TRUE)
m_silhouette_plot(m_CO2,rescale = TRUE, interact = FALSE)
m_sil_pca_plot(m_CO2, rescale=TRUE,interact = FALSE)
m_sil_pca_plot(m_CO2, rescale=TRUE,interact = FALSE, i=1,j=3)
m_sil_pca_plot(m_CO2, rescale=TRUE,interact = FALSE, i=2,j=3)

Davies Bouldin index

A good Davies Bouldin index is indicated at a low value. Based on the result analyzing, Davies Bouldin index of Iris data is low for all three flower species, which is consistent with results mentioned above. Specifically, Davies Bouldin index of setosa is seriously low, it, therefore, distinguishes to versicolor and virginica.

m_DB_index(m_iris)

Davies Bouldin index of the origin of the plants (CO2 data) are high (4.55), therefore, is consistent with results of R square and Silhouette coefficient.

m_DB_index(m_CO2, rescale = TRUE)

External measure

Two external measures in this section are investigated in order to comparing two partitions including Rand index and Ajusted Rand index.

Rand index

We establish an artificial partition from k-mean algorithm and we compute the Rand index between variables group and artificial partition. Rand index of Iris data is very high (0.83), hence, this partition and variables group are similar. Further, Rand index of CO2 data is equal 0.49, thus, Ajusted Rand index is examined.

m_kmean_rand_index(m_iris) 
m_kmean_rand_index(m_CO2, rescale=TRUE)

Ajusted Rand index

Based on the results of Ajusted Rand index of Iris data, we could strongly confirm that artificial partition and variables group are strongly similar. Furthermore, Ajusted Rand index of CO2 data is low, resulting that there is no similar of two partitions.

m_kmean_rand_ajusted(m_iris)
m_kmean_rand_ajusted(m_CO2, rescale = TRUE)

Comparison between kmean clustering and variable group

This graphic illustrates individuals on the factorial plan in which different shapes represent variables groups, at the meanwhile, distinguish colors represent artificial partitions.

m_kmean_clustering_plot(m_iris, interact = FALSE)
m_kmean_clustering_plot(m_iris, interact = FALSE, i=1,j=3)
m_kmean_clustering_plot(m_iris, interact = FALSE,i=2,j=3)
m_kmean_clustering_plot(m_CO2, rescale=TRUE, interact = FALSE)
m_kmean_clustering_plot(m_CO2, rescale=TRUE, interact = FALSE, i=1,j=3)
m_kmean_clustering_plot(m_CO2, rescale=TRUE, interact = FALSE, i=2,j=3)

Test value for clustering

Data analyzing allows us to several conclusions. First, Petal.Length shows the most effective characterization on Setosa and virginica. Second, Sepal.Width characterizes most powerful to versicolor.

m_test.value(m_iris)
m_test.value(m_iris,i=2)
m_test.value(m_iris,i=3)

Multiple component analysis

This function used to detect and represent underlying structures in a qualitative data set.

m_acm_plot(multivariate_object(CO2[,1:3],1))

Here we can see that the fours groups are very distinct and if we look at the data it's normal. On this dataset this function is a little bit useless because it return logical facts.

Let's practice on your own dataset

Now that you have seen how the package works, it is up to you to use it on your own data. Remember to use the help() function to access all documentation and **don't forget to use "interact=TRUE" on the plot to see interactive plot !



clepadellec/ClustersAnalysis documentation built on Dec. 31, 2020, 10:03 p.m.