knitr::opts_chunk$set(echo = TRUE,eval=TRUE,fig.align="center",fig.width = 7,fig.height = 6)

The package **Clustgeo** is used here for clustering $n=303$ french cities. The R dataset **estuary** is a list of three objects: a matrix **dat** with the description of the 303 cities on 4 socio-demographic variables, a matrix **D.geo** with the distances between the town hall of the 303 cities, and an object **map** of class "SpatialPolygonsDataFrame".

library(ClustGeo) data(estuary) dat <- estuary$dat head(dat) D.geo <- estuary$D.geo map <- estuary$map

In this section, we show how to implement and interpret Ward hierarchical clustering when the dissimilarities are not necessary euclidean and the weights are not uniform.

We apply first standard **hclust** function.

n <- nrow(dat) D <- dist(dat) Delta <- D^2/(2*n) tree <- hclust(Delta,method="ward.D")

We can check that the sum of the heights in **hclust**'s dendrogram is equal to the total inertia of the dataset if the dissimilarities are euclidean and to the pseudo inertia otherwise.

?inertdiss #pseudo inertia when dissimilarities are non euclidean ?inert #standard inertia otherwise inertdiss(D) #pseudo inertia inert(dat) # inertia sum(tree$height)

The same result can be obtained with the function **hclustgeo** which is a wrapper of **hclust** taking $D$ as input instead of $\Delta$.

tree <- hclustgeo(D) sum(tree$height)

When the weights are not uniform, the calculation of the matrix $\Delta$ takes a few lines of code and the use of the function **hclustgeo** may be more convenient than **hclust**.

map <- estuary$map wt <- map@data$POPULATION # non uniform weights # with hclust Delta <- D for (i in 1:(n-1)) { for (j in (i+1):n) { Delta[n*(i-1) - i*(i-1)/2 + j-i] <- Delta[n*(i-1) - i*(i-1)/2 + j-i]^2*wt[i]*wt[j]/(wt[i]+wt[j]) }} tree <- hclust(Delta,method="ward.D",members=wt) sum(tree$height) #with hclustgeo tree <- hclustgeo(D,wt=wt) sum(tree$height)

Now we consider two dissimilarity matrices :

- $D_0$ is the euclidean distance matrix performed with the socio-demographic variables
- $D_1$ is a second dissimilarity matrix used to take the geographical proximity between the cities into account.

The function **hclustgeo** implements a Ward-like hierarchical clustering algorithm including soft contiguity constraints. This algorithm takes as input two dissimilarity matrices
D0 and D1 and a mixing parameter $\alpha$ between 0 an 1. The first matrix gives the dissimilarities in the "feature space" (here socio-demographic variables). The second matrix gives the dissimilarities in the "constraint" space (here the matrix of geographical distances or the matrix build from the contiguity matrix $C$). The mixing parameter $\alpha$ sets the importance of the constraint in the clustering procedure. We present here a procedure to choose the mixing parameter $\alpha$ with the function **choicealpha**.

First, we choose $K=5$ clusters from the Ward dendrogram obtained with the socio-demographic variables ($D_0$) only.

library(ClustGeo) data(estuary) dat <- estuary$dat D.geo <- estuary$D.geo map <- estuary$map D0 <- dist(dat) # the socio-demographic distances tree <- hclustgeo(D0) plot(tree,hang=-1,label=FALSE, xlab="",sub="", main="Ward dendrogram with D0 only",cex.main=0.8,cex=0.8,cex.axis=0.8,cex.lab=0.8) #plot(tree,hang=-1,xlab="",sub="",main="Ward dendrogram with D0 only", # cex.main=0.8,cex=0.8,labels=city_label,cex.axis=0.8,cex.lab=0.8) rect.hclust(tree,k=5,border=c(4,5,3,2,1)) legend("topright", legend= paste("cluster",1:5), fill=1:5, cex=0.8,bty="n",border="white")

postscript("dendroD0.eps") #plot(tree,hang=-1,label=FALSE, xlab="",sub="", # cex=0.8,cex.axis=0.8,cex.lab=0.8,main="") plot(tree,hang=-1,label=FALSE, xlab="",sub="", main="") rect.hclust(tree,k=5,border=c(4,5,3,2,1)) legend("topright", legend= paste("cluster",1:5), fill=1:5,bty="n",border="white") dev.off()

We can use the map given with the estuary data.

# the map of the cities is an object of class "SpatialPolygonsDataFrame" class(map) # the object map contains several informations names(map) head(map@data[,4:8]) # we check that the cities in map are the same than those in X identical(as.vector(map$"INSEE_COM"),rownames(dat)) # now we plot the cities on the map with the name of four cities city_label <- as.vector(map$"NOM_COMM") sp::plot(map,border="grey") text(sp::coordinates(map)[c(54,99,117,116),],labels=city_label[c(54,99,117,116)],cex=0.8)

Let us plot these 5 clusters on the map.

# cut the dendrogram to get the partition in 5 clusters P5 <- cutree(tree,5) names(P5) <- city_label sp::plot(map,border="grey",col=P5,main="5 clusters partition obtained with D0 only",cex.main=0.8) legend("topleft", legend=paste("cluster",1:5), fill=1:5, cex=0.8,bty="n",border="white")

postscript("map-P5.eps") sp::plot(map,border="grey",col=P5) #legend(locator(1), legend=paste("cluster",1:5), fill=1:5, bty="n",border="white") legend("topleft", inset=0.2,legend=paste("cluster",1:5), fill=1:5, bty="n",border="white") dev.off()

We can notice that the cities in the cluster 5 are geographically very homogeneous.

# list of the cities in cluster 5 city_label[which(P5==5)] #plot(map,border="grey") #text(coordinates(map)[which(P5==5),],labels=city_label[which(P5==5)],cex=0.8)

On the contrary the cities in cluster 3 are geographically very splitted.

In order to get more geographically compact clusters, we introduce now the matrix $D_1$ of the geographical distances in **hclustgeo**.

D1 <- as.dist(D.geo) # the geographic distances between the cities #

For that purpose, we have to choose a mixing parameter $\alpha$ to improve the geographical cohesion of the 5 clusters of the partition found previously without deteriorating the socio-demographic cohesion too much.

The mixing parameter $\alpha \in [0,1]$ sets the importance of $D_0$ and $D_1$ in the clustering process. When $\alpha=0$ the geographical dissimilarities are not taken into account and when $\alpha=1$ it is the socio-demographic distances which are not taken into account and the clusters are obtained with the geographical distances only.

The idea is then to calculate separately the socio-demographic homogeneity and the geographic cohesion of the partitions obtained for a range of different values of $\alpha$ and a given number of clusters $K$.

The idea is to plot the quality criterion $Q_O$ and $Q_1$ of the partitions $P_K^\alpha$ obtained with different values of $\alpha \in [0,1]$ and to choose the value of $\alpha$ which is a compromise between the lost of socio-demographic homogeneity and the gain of geographic cohesion. We use the function **choicealpha** for that purpose.

range.alpha <- seq(0,1,0.1) K <- 5 cr <- choicealpha(D0,D1,range.alpha,K,graph=TRUE) cr$Q # proportion of explained pseudo inertia cr$Qnorm # normalized proportion of explained pseudo inertia ?plot.choicealpha #plot(cr,norm=TRUE)

#postscript("plotQ-1.eps",width = 950,height = 489) plot(cr) #dev.off() #postscript("plotQnorm-1.eps") plot(cr,norm=TRUE) #dev.off()

We see on the plot the proportion of explained pseudo inertia calculated with $D_0$ (the socio-demographic distances) is equal to 0.81 when $\alpha=0$ and decreases when $\alpha$ inceases (black line). On the contrary the proportion of explained pseudo inertia calculated with $D_1$ (the socio-demographic distances) is equal to 0.87 when $\alpha=1$ and decreases when $\alpha$ decreases (red line).

Here the plot suggest to choose $\alpha=0.2$ which correponds to a lost of socio-demographic homogeneity of only 7 \% and a gain of geographic homogeneity of about 17 \%.

We perform **hclustgeo** with $D_0$ and $D_1$ and $\alpha=0.2$ and cut the tree to get the new partition in 5 clusters.

tree <- hclustgeo(D0,D1,alpha=0.2) P5bis <- cutree(tree,5)

The gain in geographic cohesion of this partition can also be visualized on the map.

tree <- hclustgeo(D0,D1,alpha=0.2) P5bis <- cutree(tree,5) sp::plot(map,border="grey",col=P5bis, main="5 clusters partition obtained \n with alpha=0.2 and geographical distances",cex.main=0.8) legend("topleft", legend=paste("cluster",1:5), fill=1:5, bty="n",border="white")

postscript("map-P5bis.eps") sp::plot(map,border="grey",col=P5bis) #legend(locator(1), legend=paste("cluster",1:5), fill=1:5, bty="n",border="white") legend("topleft", inset=0.2,legend=paste("cluster",1:5), fill=1:5, bty="n",border="white") dev.off()

Let us construct a different type of matrix $D_1$ to take the neighborhood between the regions into account for clustering the 303 cities.

Two regions with contiguous boundaries, that is sharing one or more boundary point are considered as neighbours. Let us first build the adjacency matrix $A$.

#library(spdep) list.nb <- spdep::poly2nb(map,row.names = rownames(dat)) #list of neighbours of each city city_label[list.nb[[117]]] # list of the neighbours of BORDEAUX A <- spdep::nb2mat(list.nb,style="B") diag(A) <- 1 colnames(A) <- rownames(A) <- city_label A[1:5,1:5]

The dissimilarity matrix $D_1$ is build from the adjacency matrix $A$ with $D_1=1-A$.

D1 <- as.dist(1-A)

The same procedure for the choice of $\alpha$ is then used with this neighborhood dissimilarity matrix $D_1$.

cr <- choicealpha(D0,D1,range.alpha,K,graph=TRUE) cr$Q # proportion of explained pseudo inertia cr$Qnorm # normalized proportion of explained pseudo inertia

#postscript("plotQ-1.eps",width = 950,height = 489) plot(cr) #dev.off() #postscript("plotQnorm-1.eps") plot(cr,norm=TRUE) #dev.off()

tree <- hclustgeo(D0,D1,alpha=0.2) P5ter <- cutree(tree,5) sp::plot(map,border="grey",col=P5ter, main="5 clusters partition obtained with \n alpha=0.2 and neighbours dissimilarities",cex.main=0.8) legend("topleft", legend=paste("cluster",1:5), fill=1:5, bty="n",border="white",cex=0.8)

postscript("map-P5ter.eps") sp::plot(map,border="grey",col=P5ter) #legend(locator(1), legend=paste("cluster",1:5), fill=1:5, bty="n",border="white") legend("topleft", inset=0.2,legend=paste("cluster",1:5), fill=1:5, bty="n",border="white") dev.off()

With this kind of local dissimilaritis in $D_1$, the neighborhood within cohesion is always very small. To overcome this problem, we can use the plot the normalized proportion of explained inertia ($Qnorm$) instead the proportion of explained inertia ($Q$). The plot of $Qnorm$ suggest again $\alpha=0.2$.

tree <- hclustgeo(D0,D1,alpha=0.2) P5bis <- cutree(tree,5) sp::plot(map,border="grey",col=P5bis, main="5 clusters partition obtained with \n alpha=0.2 and neighborhood dissimilarities",cex.main=0.8) legend("topleft", legend=1:5, fill=1:5, col=P5,cex=0.8)

These plots can be obtained with the plot method **plot.choicealpha**.

range.alpha <- seq(0,1,0.1) K <- 5 cr <- choicealpha(D0,D1,range.alpha,K,graph=FALSE) ?plot.choicealpha plot(cr,cex=0.8,norm=FALSE,cex.lab=0.8,ylab="pev",col=3:4,legend=c("socio-demo","geo"), xlab="mixing parameter")

plot(cr,cex=0.8,norm=TRUE,cex.lab=0.8,ylab="pev",col=5:6,pch=5:6,legend=c("socio-demo","geo"), xlab="mixing parameter")

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.