In comeetie/greed: Clustering and Model Selection with the Integrated Classification Likelihood

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)

```{css, echo=FALSE} p.caption { font-style: italic; }

Loads packages.
```r
library(greed)
library(dplyr)
library(ggplot2)
library(ggsoccer)
set.seed(2134)

Mixed-type data clustering

One of distinctive feature of greed is that you may combine several model together. Using such an approach one may perform a clustering of several view of the same individuals with different generative models. In all case this approach suppose that the different views are generated in an independent way when the clustering assignment is known. To explore this capability we will begin with the Fifa dataset and cluster simultaneously the categorical features with an LcaPrior() and the continuous one with a GmmPrior().

Lca + Gmm, the Fifa data-set

We first load the ?Fifa dataset (several features of Soccer players in the Fifa 2020 videogame). We remove the first two columns that give the players names and nationality, and we kept only the players which earn more than 1000000 euros.

data(Fifa)
X <- Fifa[,-c(1,2)] %>% 
  filter(value_eur>1000000) %>% 
  select(-value_eur,-GK,-pos_x,-pos_y)

This leave us with a dataset of $n=r nrow(X)$ rows and $p=r ncol(X)$ columns. The features correspond to players characteristics (age,height,weight), skills scores (pace,shooting,...) and a set of binary factors that describe the field positions were the players may be aligned.

Preparing the data and the model is a little bit more complicated than for the other models since we must specify the different models we want to use. The data are provided with a named list, and the model is created with the ?CombinedModels constructor, which must also be provided with a list of model priors, with the same named slots as in the data list. The following piece of code demonstrate this approach and the Fifa dataset:

library(future)
plan(multisession,workers=5)# may be increased
# prepare the dataset create a list with the factor columns in one slot 
# and the numeric columns in another slot
data <- list(categorical = X %>% select_if(is.factor),
            cont = X %>% select_if(function(v){!is.factor(v)}))
# create the CombinedModels, as expected we will use an LcaPrior for the factors 
# and a GmmPrior for the numeric columns.
# Be careful names in the models list and in the data list must match !
mixmod <- CombinedModels(models=list(categorical=LcaPrior(),cont=GmmPrior()))
# perform the clustering
sol = greed(data,model = mixmod)

When the model is fitted the extractSubModel() function allow to retrieve the fitted sub-models, and use the plotting capabilities of each of them. We may for example look at the categorical part of the model and make a marginals plot. We see with this figure that the clusters found agree quite strongly with the field positions features. The preferred foot feature has also an impact on cluster 2 and 3.

plot(extractSubModel(sol,"categorical"),type="marginals")

When looking at the continuous features the clusters are quite well organized, the hierarchical ordering has produced a meaningful ordering, look at the shooting feature. We will go back to this point later.

plot(extractSubModel(sol,"cont"),type="violins")

For the marginals plots, one things that pop out is the clear bimodal aspect of the defending features, the mixture of the remaining feature is clearly not so well separated.

plot(extractSubModel(sol,"cont"),type="marginals")

To clearly show, the alignment of the cluster with field position we will make a little figure and plot for each cluster their mean position on the field. This will be computed as the weighted average coordinate of each field position were the weight correspond to the probability that a cluster member can play at this position.

clust_positions = Fifa[,-c(1,2)] %>% 
  filter(value_eur>1000000) %>%
  select(pos_x,pos_y) %>%
  mutate(cluster=clustering(sol)) %>%
  group_by(cluster) %>%
  summarize(pos_x=mean(pos_x),pos_y=mean(pos_y))

ggplot(clust_positions)+
  annotate_pitch()+
  geom_text(aes(x=pos_x,y=pos_y,label=cluster),size=5,col="red")+
  theme_pitch()

The organization of the clusters clearly appears, cluster 7 correspond to central defensive players, cluster 5 and 6 to lateral defensive players (recall that the preferred foot of these player were quite different between this two clusters with almost only right footed players in cluster 6 and left footed players in cluster 5).

Sbm + Lca, the Ndrangheta data-set

Combined models may also be used to analyze graph datasets with node attribute for example. To demonstrate this feature we will use the Ndrangheta dataset. This dataset contains relationship information between members of the Ndrangheta mafia together with some information on their group affiliation and their role. We will use an Sbm model to fit the adjacency matrix and an Lca model for the factor describing the roles.

# laod the dataset
data(Ndrangheta)
# prepare the data 
data= list(graph=Ndrangheta$X,node_infos=data.frame(Role=Ndrangheta$node_meta$Role,stringsAsFactors = TRUE))
# build the combined model
mixmod = CombinedModels(models= list(graph=SbmPrior(),node_infos=LcaPrior()))
# cluster
sol=greed(data,model=mixmod)
# analyze the results
plot(sol)

plot(extractSubModel(sol,"graph"),type="blocks")

plot(extractSubModel(sol,"node_infos"),type="marginals")

comeetie/greed documentation built on Oct. 10, 2022, 5:37 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

comeetie/greed
Clustering and Model Selection with the Integrated Classification Likelihood

In comeetie/greed: Clustering and Model Selection with the Integrated Classification Likelihood

Mixed-type data clustering

Lca + Gmm, the Fifa data-set

Sbm + Lca, the Ndrangheta data-set

R Package Documentation

Browse R Packages

We want your feedback!

comeetie/greed Clustering and Model Selection with the Integrated Classification Likelihood

In comeetie/greed: Clustering and Model Selection with the Integrated Classification Likelihood

Mixed-type data clustering

Lca + Gmm, the Fifa data-set

Sbm + Lca, the Ndrangheta data-set

R Package Documentation

Browse R Packages

We want your feedback!

comeetie/greed
Clustering and Model Selection with the Integrated Classification Likelihood