# Introduction to Clustgeo In ClustGeo: Hierarchical Clustering with Spatial Constraints

knitr::opts_chunkset(echo = TRUE,eval=TRUE,fig.align="center",fig.width = 7,fig.height = 6)  The package Clustgeo is used here for clusteringn=303$french cities. The R dataset estuary is a list of three objects: a matrix dat with the description of the 303 cities on 4 socio-demographic variables, a matrix D.geo with the distances between the town hall of the 303 cities, and an object map of class "SpatialPolygonsDataFrame". library(ClustGeo) data(estuary) dat <- estuary$dat
D.geo <- estuary$D.geo map <- estuary$map


## Ward hierarchical clustering with non euclidean dissimilarity measures and non uniform weights.

In this section, we show how to implement and interpret Ward hierarchical clustering when the dissimilarities are not necessary euclidean and the weights are not uniform.

We apply first standard hclust function.

n <- nrow(dat)
D <- dist(dat)
Delta <- D^2/(2*n)
tree <- hclust(Delta,method="ward.D")


We can check that the sum of the heights in hclust's dendrogram is equal to the total inertia of the dataset if the dissimilarities are euclidean and to the pseudo inertia otherwise.

?inertdiss #pseudo inertia when dissimilarities are non euclidean
?inert #standard inertia otherwise
inertdiss(D) #pseudo inertia
inert(dat) # inertia
sum(tree$height)  The same result can be obtained with the function hclustgeo which is a wrapper of hclust taking$D$as input instead of$\Delta$. tree <- hclustgeo(D) sum(tree$height)


When the weights are not uniform, the calculation of the matrix $\Delta$ takes a few lines of code and the use of the function hclustgeo may be more convenient than hclust.

map <- estuary$map wt <- map@data$POPULATION # non uniform weights
# with hclust
Delta <-  D
for (i in 1:(n-1)) {
for (j in (i+1):n) {
Delta[n*(i-1) - i*(i-1)/2 + j-i] <-
Delta[n*(i-1) - i*(i-1)/2 + j-i]^2*wt[i]*wt[j]/(wt[i]+wt[j]) }}
tree <- hclust(Delta,method="ward.D",members=wt)
sum(tree$height) #with hclustgeo tree <- hclustgeo(D,wt=wt) sum(tree$height)


## Ward-like hierarchical clustering with geographical constraints.

Now we consider two dissimilarity matrices :

• $D_0$ is the euclidean distance matrix performed with the socio-demographic variables
• $D_1$ is a second dissimilarity matrix used to take the geographical proximity between the cities into account.

The function hclustgeo implements a Ward-like hierarchical clustering algorithm including soft contiguity constraints. This algorithm takes as input two dissimilarity matrices D0 and D1 and a mixing parameter $\alpha$ between 0 an 1. The first matrix gives the dissimilarities in the "feature space" (here socio-demographic variables). The second matrix gives the dissimilarities in the "constraint" space (here the matrix of geographical distances or the matrix build from the contiguity matrix $C$). The mixing parameter $\alpha$ sets the importance of the constraint in the clustering procedure. We present here a procedure to choose the mixing parameter $\alpha$ with the function choicealpha.

### Choice of a partition with $D_0$ only

First, we choose $K=5$ clusters from the Ward dendrogram obtained with the socio-demographic variables ($D_0$) only.

library(ClustGeo)
data(estuary)
dat <- estuary$dat D.geo <- estuary$D.geo
map <- estuary$map D0 <- dist(dat) # the socio-demographic distances tree <- hclustgeo(D0) plot(tree,hang=-1,label=FALSE, xlab="",sub="", main="Ward dendrogram with D0 only",cex.main=0.8,cex=0.8,cex.axis=0.8,cex.lab=0.8) #plot(tree,hang=-1,xlab="",sub="",main="Ward dendrogram with D0 only", # cex.main=0.8,cex=0.8,labels=city_label,cex.axis=0.8,cex.lab=0.8) rect.hclust(tree,k=5,border=c(4,5,3,2,1)) legend("topright", legend= paste("cluster",1:5), fill=1:5, cex=0.8,bty="n",border="white")  postscript("dendroD0.eps") #plot(tree,hang=-1,label=FALSE, xlab="",sub="", # cex=0.8,cex.axis=0.8,cex.lab=0.8,main="") plot(tree,hang=-1,label=FALSE, xlab="",sub="", main="") rect.hclust(tree,k=5,border=c(4,5,3,2,1)) legend("topright", legend= paste("cluster",1:5), fill=1:5,bty="n",border="white") dev.off()  We can use the map given with the estuary data. # the map of the cities is an object of class "SpatialPolygonsDataFrame" class(map) # the object map contains several informations names(map) head(map@data[,4:8]) # we check that the cities in map are the same than those in X identical(as.vector(map$"INSEE_COM"),rownames(dat))

# now we plot the cities on the map with the name of four cities
cr$Qnorm # normalized proportion of explained pseudo inertia  #postscript("plotQ-1.eps",width = 950,height = 489) plot(cr) #dev.off() #postscript("plotQnorm-1.eps") plot(cr,norm=TRUE) #dev.off()  tree <- hclustgeo(D0,D1,alpha=0.2) P5ter <- cutree(tree,5) sp::plot(map,border="grey",col=P5ter, main="5 clusters partition obtained with \n alpha=0.2 and neighbours dissimilarities",cex.main=0.8) legend("topleft", legend=paste("cluster",1:5), fill=1:5, bty="n",border="white",cex=0.8)  postscript("map-P5ter.eps") sp::plot(map,border="grey",col=P5ter) #legend(locator(1), legend=paste("cluster",1:5), fill=1:5, bty="n",border="white") legend("topleft", inset=0.2,legend=paste("cluster",1:5), fill=1:5, bty="n",border="white") dev.off()  With this kind of local dissimilaritis in$D_1$, the neighborhood within cohesion is always very small. To overcome this problem, we can use the plot the normalized proportion of explained inertia ($Qnorm$) instead the proportion of explained inertia ($Q$). The plot of$Qnorm$suggest again$\alpha=0.2$. #### Modified partition obtained with$\alpha=0.2\$.

tree <- hclustgeo(D0,D1,alpha=0.2)
P5bis <- cutree(tree,5)
sp::plot(map,border="grey",col=P5bis, main="5 clusters partition obtained with \n alpha=0.2 and neighborhood dissimilarities",cex.main=0.8)
legend("topleft", legend=1:5, fill=1:5, col=P5,cex=0.8)


### The method plot.choicealpha

These plots can be obtained with the plot method plot.choicealpha.

range.alpha <- seq(0,1,0.1)
K <- 5
cr <- choicealpha(D0,D1,range.alpha,K,graph=FALSE)
?plot.choicealpha
plot(cr,cex=0.8,norm=FALSE,cex.lab=0.8,ylab="pev",col=3:4,legend=c("socio-demo","geo"),
xlab="mixing parameter")

plot(cr,cex=0.8,norm=TRUE,cex.lab=0.8,ylab="pev",col=5:6,pch=5:6,legend=c("socio-demo","geo"),
xlab="mixing parameter")


## Try the ClustGeo package in your browser

Any scripts or data that you put into this service are public.

ClustGeo documentation built on May 2, 2019, 10:15 a.m.