knitr::opts_chunk$set(echo = TRUE)

This document is a short introduction to the package **geocmeans**. It implements a fuzzy classification method bringing spatial information and neighbouring in its calculation

In their article, @cai2007fast described the method, originally applied in the analysis of brain imagery. The generalized version of the spatial fuzzy c-means is presented by @zhao2013kernel.

@gelb2021apport applied the method to socio-residential and environmental data comparing the results with other unsupervised classification algorithms (in French).

There are actually numerous packages and functions to perform unsupervised classification in R (*hclust*, *kmeans*, *cmeans*, *factoextra*, etc.). However, these methods are not always well-suited to analyze spatial data. Indeed, they do not account for spatial information such as proximity or contiguity between observations. This may lead to solutions for which close observations end up in different groups event though they are very similar.

To our knowledge, the package **ClustGeo** is the only package proposing an unsupervised classification method which directly consider spatial proximity between observations. The proposed approach is appealing because the user can select a parameter (*alpha*) that controls the weight of the spatial distance matrix (calculated between observations with their locations) versus the semantic distance matrix (calculated between observations with their variables).

However, this method belongs to the category of "hard-clustering" algorithms. Each observation ends up in one cluster/group. The main draw-back here is the difficulty to identify observations that are undecided, at the frontier of two clusters/groups. The soft or fuzzy clustering algorithms provide more information because they calculate the "probability" of each observation to belong to each group.

The algorithms SFCM (spatial fuzzy c-means) and SGFCM (spatial generalized fuzzy c-means) propose to combine the best of both worlds.

The package **geocmeans** is an implementation in R of these methods (originally developed for analysis of brain imagery). It comes with a set of functions to facilitate the analysis of the final membership matrices :

- calculating many quality indices (coming mainly from the package
*fclust*) - mapping the results
- giving summary statistics for each cluster/group.

**geocmeans** comes with a toy dataset *LyonIris*, combining many demographic and environmental variables aggregated at the scale of the Iris (aggregated units for statistical information) in Lyon (France).

Before starting the analysis, the data must be standardized because most of the calculus is based on Euclidean distance (we plan to also include Manhattan distance in a future release).

#charging packages and data library(geocmeans) library(ggplot2) library(ggpubr) library(dplyr) library(viridis) library(spdep) data(LyonIris) #selecting the columns for the analysis AnalysisFields <-c("Lden","NO2","PM25","VegHautPrt","Pct0_14", "Pct_65","Pct_Img","TxChom1564","Pct_brevet","NivVieMed") #rescaling the columns Data <- LyonIris@data[AnalysisFields] for (Col in names(Data)){ Data[[Col]] <- scale(Data[[Col]]) } #preparing some elements for further mapping LyonIris$OID <- as.character(1:nrow(LyonIris)) FortiData <- broom::tidy(LyonIris,region="OID")

#loading the pre-calculated results load(system.file("extdata", "results_vignette_intro.rda", package = "geocmeans", mustWork = TRUE))

To explore the dataset and choose the right number of cluster/groups (k) we propose to start with a classical kmeans.

#finding the best k by using the r2 of the classification #trying for k from 2 to 10 R2s <- sapply(2:10,function(k){ Clust <- kmeans(Data,centers=k,iter.max = 150) R2 <- Clust$betweenss / Clust$totss return(R2) }) Df <- data.frame(K=2:10, R2 = R2s) ggplot(Df)+ geom_line(aes(x=K,y=R2s))+ geom_point(aes(x=K,y=R2s),color="red")+ xlab("Number of groups")+ ylab("R2 of classification")

By plotting the R-squared of the kmeans classification for each k between 2 and 10, we can see a first elbow at k=3. But this small number of groups leads to a classification explaining only 43% of the original data variance. We decide to keep k=4 to have one more group in the analysis.

Let us map the obtained groups.

KMeanClust <- kmeans(Data,centers=4,iter.max = 150) LyonIris$Cluster <-paste("cluster",KMeanClust$cluster,sep="_") #mapping the groups DFmapping <- merge(FortiData,LyonIris,by.x="id",by.y="OID") ggplot(data=DFmapping)+ geom_polygon(aes(x=long,y=lat,group=group,fill=Cluster),color=rgb(0,0,0,0))+ coord_fixed(ratio = 1)+ scale_fill_manual(name="Cluster",values = c("cluster_1"="palegreen3", "cluster_2"="firebrick", "cluster_3"="lightyellow2", "cluster_4"="steelblue"))+ theme( axis.title = ggplot2::element_blank(), axis.text = ggplot2::element_blank(), axis.ticks = ggplot2::element_blank() )

We can clearly distinguish 4 strong spatial structures, but with some mixing between the clusters.

We could now compare this solution with a classical c-means algorithm.

The classical c-means is a simple method to perform fuzzy unsupervised classification. The package **geocmeans** proposes the function *CMeans*. We set the fuzziness degree (m) to 1.5.

We do not present the algorithm here, but only the two main formulas.

The first one is used to update the values of the membership matrix at each iteration $u_ik$

$$u_{ik} = \frac{(||x_{k} - v{*i}||^{2}) ^{(-1/(m-1))}}{\sum*{j=1}^c(||x_{k} - v{_j}||^2 )^{(-1/(m-1))}}$$
And the second one to update the centers of the clusters

$$v_{i} = \frac{\sum_{k=1}^N u_{ik}^m(x_{k})}{\sum_{k=1}^N u_{ik}^m}$$

With :

- $x_k$ the values of the observation $k$
- $v_i$ the values of the center of the cluster $i$
- $c$ the number of clusters
- $m$ the fuzziness index

Cmean <- CMeans(Data,4,1.5,500,standardize = FALSE, seed = 456, tol = 0.00001, verbose = FALSE)

We can now use the function *calcqualityIndexes* which combines many indices from the package **fclust** to analyze the quality of the classification. We will use these values later for the purpose of comparisons among the different algorithms.

calcqualityIndexes(Data, Cmean$Belongings, m = 1.5)

Note, **geocmeans** also proposes a so-called generalized version of the c-means algorithm. It is known to accelerate convergence and yield less fuzzy results by adjusting the membership matrix at each iteration. It requires an extra *beta* parameter controlling the strength of the modification. The modification only affects the formula updating the membership matrix.

$$u_{ik} = \frac{(||x_{k} - v{*i}||^{2} - \beta_k) ^{(-1/(m-1))}}{\sum*{j=1}^c(||x_{k} - v{_j}||^2 - \beta_k)^{(-1/(m-1))}}$$

with $\beta_k = min(||x_{k} - v||^2)$ and $0 \leq \beta \leq 1$.

To select an appropriate value for this parameter, we will try all the possible values between 0 and 1 with a step of 0.05.

beta_values <- selectParameters("GFCM",data = Data, k = 4, m = 1.5, beta = seq(0,1,0.05), spconsist = FALSE, tol = 0.00001, seed = 456)

knitr::kable(beta_values[c("beta","Silhouette.index","XieBeni.index","Explained.inertia")], col.names = c("beta", "silhouette index", "Xie and Beni index", "explained inertia"),digits = 3)

Considering the table above, we select *beta* = 0.7, it maintains a good silhouette index, and increases Xie and Beni index and explained inertia. Let us compare the results of GFCM and FCM.

GCmean <- GCMeans(Data,k = 4,m = 1.5, beta = 0.7,500,standardize = FALSE, seed=456, tol = 0.00001, verbose = FALSE) r1 <- calcqualityIndexes(Data,GCmean$Belongings,m=1.5) r2 <- calcqualityIndexes(Data,Cmean$Belongings,m=1.5) df <- cbind(unlist(r1), unlist(r2)) knitr::kable(df, digits = 3,col.names = c("GFCM", "FCM"))

The results indicate that the GFCM provides a solution that is less fuzzy (higher explained inertia and lower partition entropy) but keeps a good silhouette index and an even better Xie and Beni index.

We can now map the two membership matrices and the most likely group for each observation. To do so, we use the function *mapClusters* from **geocmeans**. We propose here to define a threshold of 0.45. If an observation only has values below this probability in a membership matrix, it will be labeled as "undecided" (represented with transparency on the map).

We can compare the maps of the classical c-means and the generalized version.

cmeansMaps<- mapClusters(LyonIris,Cmean$Belongings,undecided = 0.45) GcmeansMaps<- mapClusters(LyonIris,GCmean$Belongings,undecided = 0.45) ggarrange(cmeansMaps$ProbaMaps[[1]],GcmeansMaps$ProbaMaps[[1]], nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

ggarrange(cmeansMaps$ProbaMaps[[2]],GcmeansMaps$ProbaMaps[[2]], nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

ggarrange(cmeansMaps$ProbaMaps[[3]],GcmeansMaps$ProbaMaps[[3]], nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

ggarrange(cmeansMaps$ProbaMaps[[4]],GcmeansMaps$ProbaMaps[[4]], nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

ggarrange(cmeansMaps$ClusterPlot,GcmeansMaps$ClusterPlot, nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

As expected, the results are very similar, but the generalized version provides a more clear-cut classification.

Now we can use the *SFCM* function to perform a spatial c-means. The first step is to define a spatial weight matrix indicating which observations are neighbours and the strength of their relationship. We propose here to use a basic queen neighbour matrix (built with **spdep**). The matrix must be row-standardized to ensure that the interpretation of all the parameters remains clear.

library(spdep) Neighbours <- poly2nb(LyonIris,queen = TRUE) WMat <- nb2listw(Neighbours,style="W",zero.policy = TRUE)

The main challenge with the SFCM method is to select the parameter *alpha*. It represents the weight of the spatial dimension (lagged values) in the calculus of the membership matrix and the cluster centers.

- If alpha=0, then we end up with a classical c-means algorithm.
- If alpha=1, then the original and the lagged values have the same weight
- If alpha=2, then the lagged values are twice more important than the original values
- end so on...

The two following formulas show how the functions updating the membership matrix and the centers of the clusters are modified.

$$u_{ik} = \frac{(||x_{k} - v{*i}||^2 + \alpha||\bar{x*{k}} - v{*i}||^2)^{(-1/(m-1))}}{\sum*{j=1}^c(||x_{k} - v{*j}||^2 + \alpha||\bar{x*{k}} - v{_j}||^2)^{(-1/(m-1))}}$$

$$v_{i} = \frac{\sum_{k=1}^N u_{ik}^m(x_{k} + \alpha\bar{x_{k}})}{(1 + \alpha)\sum_{k=1}^N u_{ik}^m}$$ with $\bar{x}$ the lagged version of x and $\alpha \geq 0$

As the formula suggests, the *SFCM* can be seen as a spatially smoothed version of the classical c-means and *alpha* controls the degree of spatial smoothness. This smoothing can be interpreted as an attempt to reduce spatial overfitting of the classical c-means.

To select *alpha*, we propose to check all possible values between 0 and 2 with a step of 0.05.

DFindices_SFCM <- selectParameters(algo = "SFCM", data = Data, k = 4, m = 1.5, alpha = seq(0,2,0.05), nblistw = WMat, standardize = FALSE, tol = 0.0001, verbose = FALSE, seed = 456)

Now we are able to check the indices to select the best *alpha*. The goal is to reduce spatial inconsistency as much as possible and to maintain a good classification quality.

Let us start with the spatial inconsistency. This indicator (developed for this package) calculates the sum of the squared differences between each observation and its neighbours on the membership matrix. Thus, the maximum for each observation is $k*j$ with *j* the number of neighbours for the observation and *k* the number of groups. A maximum is reached if each observation has 100% chance belonging to a cluster that is different from all its neighbours. So, when we sum up the values obtained for all the observations, we obtain a quantity of spatial inconsistency. This quantity is divided by the quantity obtained when randomly permuting the rows of the membership matrix. This second quantity represents the spatial inconsistency that we might expect if the observations were randomly scattered in space. We can repeat the permutation step (Monte Carlo approach) and keep the mean of the ratios to have a more robust indicator (see help(spConsistency) for details).

A smaller value indicates a smaller spatial inconsistency and thus a greater spatial consistency. 0 meaning that all observations have exactly the same values in the membership matrix as their neighbours (perfect spatial consistency).

ggplot(DFindices_SFCM)+ geom_smooth(aes(x=alpha,y=spConsistency), color = "black")+ geom_point(aes(x=alpha,y=spConsistency), color = "red")

Not surprisingly, increasing *alpha* leads to a decrease of the spatial inconsistency. This gain follows an inverse function.

Let us now check the explained inertia

ggplot(DFindices_SFCM)+ geom_smooth(aes(x=alpha,y=Explained.inertia), color = "black")+ geom_point(aes(x=alpha,y=Explained.inertia), color = "red")

As expected, the explained inertia decreases when alpha increases and again follows an inverse function. The classification has to find a compromise between the original values and the lagged values. However, the loss is very small here: only 3% between alpha = 0 and alpha = 2.

To finish here, we can observe the silhouette and Xie and Beni indicators.

ggplot(DFindices_SFCM)+ geom_smooth(aes(x=alpha,y=Silhouette.index), color = "black")+ geom_point(aes(x=alpha,y=Silhouette.index), color = "red")

ggplot(DFindices_SFCM)+ geom_smooth(aes(x=alpha,y=XieBeni.index), color = "black")+ geom_point(aes(x=alpha,y=XieBeni.index), color = "red")

The detail of the meaning of these indicators is beyond the scope of this vignette. Let us just stress that a larger silhouette index indicates a better classification, and a smaller Xie and beni index indicates a better classification.

After considering all the previous charts, we decide to keep alpha = 0.7 as it seems to provide a good balance between spatial consistency and classification quality in this case.

SFCM <- SFCMeans(Data, WMat, k = 4, m = 1.5, alpha = 0.7, tol = 0.0001, standardize = FALSE, verbose = FALSE, seed = 456)

It is also possible to use the so-called generalized version of the spatial c-means. In that case, we must define both *alpha* and *beta*.

The next formula shows how the membership matrix is updated at each iteration. Note that the centres of the clusters are updated with the same formula as SFCM.

$$u_{ik} = \frac{(||x_{k} - v{*i}||^2 -\beta_k + \alpha||\bar{x*{k}} - v{*i}||^2)^{(-1/(m-1))}}{\sum*{j=1}^c(||x_{k} - v{*j}||^2 -\beta_k + \alpha||\bar{x*{k}} - v{_j}||^2)^{(-1/(m-1))}}$$

Because we select a high resolution for our grid search of *alpha* and *beta*, we will use a multiprocessing approach.

future::plan(future::multiprocess(workers=4)) DFindices_SFGCM <- selectParameters.mc(algo = "SGFCM", data = Data, k = 4, m = 1.5, alpha = seq(0,2,0.05), beta = seq(0,0.85,0.05), nblistw = WMat, standardize = FALSE, chunk_size = 50, tol = 0.0001, verbose = FALSE, seed = 456)

ggplot(DFindices_SFGCM) + geom_raster(aes(x = alpha, y = beta, fill = Silhouette.index), size = 5) + scale_fill_viridis() + coord_fixed(ratio=1)

ggplot(DFindices_SFGCM) + geom_raster(aes(x = alpha, y = beta, fill = XieBeni.index), size = 5) + scale_fill_viridis() + coord_fixed(ratio=1)

ggplot(DFindices_SFGCM) + geom_raster(aes(x = alpha, y = beta, fill = spConsistency), size = 5) + scale_fill_viridis() + coord_fixed(ratio=1)

The first two plots indicate that some pairs of *alpha* and *beta* in particulat yield good results in the range 0.8 < *alpha* < 1.2 and 0.6 < *beta* < 0.8. The last plot shows that the selection of *beta* has no impact on the spatial consistency.

Considering the previous plots, we decide to retain the solution with *beta* = 0.65 and *alpha* = 0.95 which yield very good results for all the indices considered.

SGFCM <- SGFCMeans(Data,WMat,k = 4,m=1.5, alpha=0.95, beta = 0.65, tol=0.0001, standardize = FALSE, verbose = FALSE, seed = 456)

Again, we compare here the generalized and the classical version of the spatial c-means algorithm.

r1 <- calcqualityIndexes(Data, SFCM$Belongings,m = 1.5) r2 <- calcqualityIndexes(Data, SGFCM$Belongings,m = 1.5) diagSFCM <- spatialDiag(belongmatrix = SFCM$Belongings, nblistw = WMat, undecided = 0.45,nrep = 500) diagSGFCM <- spatialDiag(belongmatrix = SGFCM$Belongings, nblistw = WMat, undecided = 0.45,nrep = 500) df <- cbind( c(unlist(r1),diagSFCM$SpConsist), c(unlist(r2),diagSGFCM$SpConsist) ) row.names(df)[length(row.names(df))] <- "sp.consistency" knitr::kable(df,digits = 3,col.names = c("SFCM","SGFCM"))

The solution of the SGFCM is better on the semantic and the spatial aspects.

We can compare the maps

SFCMMaps <- mapClusters(geodata = LyonIris,belongmatrix = SFCM$Belongings,undecided = 0.45) SGFCMMaps <- mapClusters(geodata = LyonIris,belongmatrix = SGFCM$Belongings,undecided = 0.45) ggarrange(SFCMMaps$ProbaMaps[[1]],SGFCMMaps$ProbaMaps[[1]], nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

ggarrange(SFCMMaps$ProbaMaps[[2]],SGFCMMaps$ProbaMaps[[2]], nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

ggarrange(SFCMMaps$ProbaMaps[[3]],SGFCMMaps$ProbaMaps[[3]], nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

ggarrange(SFCMMaps$ProbaMaps[[4]],SGFCMMaps$ProbaMaps[[4]], nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

ggarrange(SFCMMaps$ClusterPlot,SGFCMMaps$ClusterPlot, nrow = 1, ncol = 2, common.legend = TRUE, legend = "bottom")

Now, we can do a deeper spatial analysis and compare the spatial consistency of the four classifications realized (FCM, GFCM, SFCM, SGFCM).

spdiag_1 <- spatialDiag(Cmean$Belongings, nblistw = WMat,nrep=250) spdiag_2 <- spatialDiag(GCmean$Belongings, nblistw = WMat,nrep=250) spdiag_3 <- spatialDiag(SFCM$Belongings, nblistw = WMat,nrep=250) spdiag_4 <- spatialDiag(SGFCM$Belongings, nblistw = WMat,nrep=250) #looking at the moran I values for each group moran_table <- data.frame(cbind(spdiag_1$MoranValues$MoranI, spdiag_2$MoranValues$MoranI, spdiag_3$MoranValues$MoranI, spdiag_4$MoranValues$MoranI )) row.names(moran_table) <- paste("cluster ",1:4,sep="") knitr::kable(moran_table, digits = 3, col.names = c("FCM","GFCM","SFCM","SGFCM"), caption = "Moran I index for the columns of the membership matrix" )

Not surprisingly, the Moran I values calculated on the membership matrices are higher for SFCM and SGFCM, indicating stronger spatial structures in the classifications.

print(c(spdiag_1$SpConsist, spdiag_2$SpConsist,spdiag_3$SpConsist,spdiag_4$SpConsist))

Considering the values of spatial inconsistency, we could check if the value obtained for SGFCM is significantly lower than the one of SFCM. Considering the previous `r length(spdiag_3$SpConsistSamples)`

permutated values, we can calculate a pseudo p-value:

sum(spdiag_4$SpConsist > spdiag_3$SpConsistSamples) / length(spdiag_3$SpConsistSamples)

It appears that out of `r length(spdiag_3$SpConsistSamples)`

permutations, the observed values of spatial inconsistency of SGFCM are always lower than that of SFCM. The difference is significant at the threshold `r 1/length(spdiag_3$SpConsistSamples)`

(=1/`r length(spdiag_3$SpConsistSamples)`

)

We can map the undecided observations of the final solution. These entities should be analyzed more precisely. Selecting them is easy with the function *undecidedUnits* of the **geocmeans** package.

Undecided <- undecidedUnits(SGFCM$Belongings,0.45) LyonIris$FinalCluster <- ifelse(Undecided=="Undecided", "Undecided",paste("cluster",Undecided,sep="_")) #mapping the groups DFmapping <- merge(FortiData,LyonIris,by.x="id",by.y="OID") ggplot(data=DFmapping)+ geom_polygon(aes(x=long,y=lat,group=group,fill=FinalCluster),color=rgb(0,0,0,0))+ coord_fixed(ratio = 1)+ scale_fill_manual(name="FinalCluster",values = c("cluster_V1"="palegreen3", "cluster_V2"="firebrick", "cluster_V3"="lightyellow2", "cluster_V4"="steelblue", "cluster_V5"="pink", "Undecided"=rgb(0,0,0,0.4)))+ theme( axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank() )

One can obtain a lot of descriptive information about the final groups with three functions of **geocmeans** :

*summarizeClusters*: calculate summary statistics for each group for a given dataset by using the membership matrix as weights (sticking with the fuzzy spirit).*spiderPlots*: display a spider plot allowing to compare quickly the differences between groups.*violinPlots*: display a violin plot for each variable in a given dataset. Observations must be grouped before.

summarizeClusters(LyonIris@data[AnalysisFields],belongmatrix = SGFCM$Belongings, weighted = TRUE, dec = 3)

spiderPlots(LyonIris@data[AnalysisFields], SGFCM$Belongings, chartcolors = c("darkorange3","grey4","darkgreen","royalblue")) violinPlots(LyonIris@data[AnalysisFields], SGFCM$Groups)

That's all, folks ! Following are the enhancements for the next version

- introduce other methods of spatial c-means
- open some other parameters to the user (such as the function defining the convergence criterion)
- work on documentation
- improve calculus speed by dropping some
*apply*in the code - adding the manhattan distance (useful when data have a high dimensionality)

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.