Nothing
#' R function for features clustering on the basis of distances/area
#'
#' The function provides the facility to cluster the features of the input dataset on the basis of
#' either their (projected) coordinates (for points; SpatialPointsDataFrame class)
#' or of their area (for polygons; SpatialPolygonsDataFrame class). If a target feature dataset
#' (to.feat) is provided, the clustering will be based
#' on the distance of the x feature to the nearest to.feature. When a to.feature is specified,
#' the x feature (i.e., the feature that the user wants to cluster)
#' can be either a point (SpatialPointsDataFrame class), or a polyline (SpatialLinesDataFrame class)
#' , or a polygon (SpatialPolygonsDataFrame class) feature.
#' Notice that if all the x features overlap with all the to.feature, all the minimum distances will
#' be 0, and the function will trow an error.\cr
#'
#' If the to.feature is not provided, the function internally calculates a distance matrix
#' (based on the Euclidean Distance) on the basis of the points' coordinates or polygons' area.
#' If the to.feature is provided, the distance matrix will be based on the distance of the x feature
#' to the nearest to.feature.
#' A dendrogram is produced which depicts the hierarchical clustering based (by default) on the
#' Ward's agglomeration method; rectangles identify the selected cluster partition.
#' Besides the dendrogram, a silhouette plot is produced, which allows to measure how 'good' is the
#' selected cluster solution.\cr
#'
#' As for the latter, if the parameter 'part' is left empty (default), an optimal cluster solution
#' is obtained.
#' The optimal partition is selected via an iterative procedure which locates at which cluster
#' solution the highest average silhouette width is achieved.
#' If a user-defined partition is needed, the user can input the desired number of clusters using
#' the parameter 'part'.
#' In either case, an additional plot is returned besides the cluster dendrogram and the silhouette
#' plot; it displays a scatterplot in which the cluster solution (x-axis)
#' is plotted against the average silhouette width (y-axis). A black dot represent the partition
#' selected either by the iterative procedure or by the user.\cr
#'
#' Notice that in the silhouette plot, the labels on the left-hand side of the chart show the point
#' ID number and the cluster to which each point is closer.\cr
#'
#' Also, the function returns a plot showing the input dataset, with features colored by cluster
#' membership. Two new variables are added to the
#' shapefile's dataframe, storing a point ID number and the corresponding cluster membership.\cr
#'
#' The silhouette plot is obtained from the 'silhouette()' function out from the 'cluster' package
#' (https://cran.r-project.org/web/packages/cluster/index.html).\cr
#' For a detailed description of the silhouette plot, its rationale, and its interpretation, see:\cr
#' Rousseeuw P J. 1987. "Silhouettes: A graphical aid to the interpretation and validation of
#' cluster analysis", Journal of Computational and Applied Mathematics 20, 53-65
#' (http://www.sciencedirect.com/science/article/pii/0377042787901257)
#'
#' For the hierarchical clustering of features, see: Conolly, J., & Lake, M. (2006).
#' Geographic Information Systems in Archaeology. Cambridge: Cambridge University Press, 168-173.\cr
#'
#' @param x Dataset whose feature are to be clustered; either points (SpatialPointsDataFrame class)
#' or polygons (SpatialPolygonsDataFrame class);
#' if the to.feat is specified, x can also be a polylines feature (SpatialLinesDataFrame class).
#' @param to.feat Dataset (NULL by default) representing the feature the distance toward which is
#' used as basis for clustering x;
#' either points (SpatialPointsDataFrame class), polygons (SpatialPolygonsDataFrame class), or
#' polylines (SpatialLinesDataFrame).
#' @param aggl.meth Agglomeration method ("ward.D2" by default).
#' @param part Desired number of clusters; if NULL (default), an optimal partition is calculated
#' (see Details).
#' @param showID TRUE (default) or FALSE if the used wants or does not want the ID of the clustered
#' features to be displayed in the plot where
#' the features are colored by cluster membership.
#' @param oneplot TRUE (default) or FALSE if the user wants or does not want the plots to be
#' visualized in a single window.
#' @param cex.dndr.lab Set the size of the labels used in the dendrogram.
#' @param cex.sil.lab Set the size of the labels used in the silhouette plot.
#' @param cex.feat.lab Set the size of the labels used (if 'showID' is set to TRUE) to show the
#' clustered features' IDs.
#' @param col.feat.lab Set the color of the clustered features' IDs ('black' by default).
#' @param export TRUE or FALSE (default) if the user wants or does not want the clustered input
#' dataset to be exported;
#' if TRUE, the input dataset with a new variable indicating the cluster membership will be exported
#' as a shapefile.
#'
#' @return The function returns a list storing the following components \itemize{
##' \item{$dist.matrix: }{distance matrix}
##' \item{$avr.silh.width.by.n.of.clusters: }{average silhouette width by number of clusters}
##' \item{$partition.silh.data: }{silhouette data for the selected partition}
##' \item{$coord.or.area.or.min.dist.by.clust: }{coordinates, area, or distance to the nearest
##' to.feat coupled with cluster membership}
##' \item{$dist.stats.by.cluster: }{by-cluster summary statistics of the x feature distance to
##' the nearest to.feature}
##' \item{$dataset: }{the input dataset with two variables added ($feat_ID and $clust, the latter
##' storing the cluster membership)}
##' }
#'
#' @keywords featClust
#'
#' @export
#'
#' @importFrom rgeos gArea
#' @importFrom raster extent
#' @importFrom rgdal writeOGR
#'
#' @examples
#' data(springs)
#'
#' #perform the analysis and automatically select an optimal partition
#' res <- featClust(springs)
#'
#' #as above, but selecting a 3-cluster partition
#' res <- featClust(springs, part=3)
#'
#' #cluster springs on the basis of their distance to the nearest geological fault
#' res <- featClust(springs, faults)
#'
#' #cluster polygonal areas on the basis of their distance to the nearest spring
#' res <- featClust(polygons, springs)
#'
#' #cluster points on the basis of their distance to the nearest polygon
#' res <- featClust(points, polygons)
#'
#'
featClust <- function(x, to.feat=NULL, aggl.meth = "ward.D2", part=NULL, showID=TRUE, oneplot=TRUE, cex.dndr.lab = 0.85, cex.sil.lab = 0.75, cex.feat.lab=0.65, col.feat.lab="black", export=FALSE) {
#if there is no to.feature, if the input dataset is a SpatialPolygonsDataframe, create a dataframe storing polygons area;
#otherwise store points' coordinates; this is skipped if there actually is a to.feature
if(is.null(to.feat)==TRUE){
ifelse(class(x)[1]=="SpatialPolygonsDataFrame",
df <- data.frame(area=gArea(x, byid=TRUE)),
df <- as.data.frame(coordinates(x)))
}
#if there is a to.feature...
if(is.null(to.feat)==FALSE){
#calculate all the pairwise distances between x and to.feat
dist.matrix <- gDistance(x, to.feat, byid=TRUE)
#then calculate the minimum distances of x toward to.feat by appling the 'min' function column-wise
df <- as.data.frame(apply(dist.matrix, 2, FUN=min))
#give name to the df column storing the minimum distances
colnames(df)[1] <- "minim.dist."
}
#store the number of features
nx <- length(x)
#calculate the Euclidean distance
d <- dist(df)
#calculate the hier. clustering
fit <- hclust(d, method = aggl.meth)
#calculate the max number of clusters to be later used in the loop which iterates the calculation of the average silhouette width
max.ncl <- nx - 1
#create a slot to store the values of the average silhouette value at increasing numbers of cluster solutions
#this will be used inside the loop
sil.width.val <- numeric(max.ncl - 1)
#create a slot to store the increasing number of clusters whose silhouette is iteratively computed
sil.width.step <- c(2:max.ncl)
#calculate the average silhouette width at each increasing cluster solution
for (i in 2:max.ncl) {
counter <- i - 1
clust <- silhouette(cutree(fit, k = i), d)
sil.width.val[counter] <- mean(clust[, 3])
}
#create a dataframe that stores the number of cluster solutions and the corresponding average silhouette values
sil.res <- as.data.frame(cbind(sil.width.step, sil.width.val))
#if the user does not enter the desired partition, the latter is equal to the optimal one, ie the one with the largest average silhouette;
#otherwise use the user-defined partition
ifelse(is.null(part)==TRUE,
select.clst.num <- sil.res$sil.width.step[sil.res$sil.width.val == max(sil.res$sil.width.val)],
select.clst.num <- part)
#create the final silhouette data using the partition established at the preceding step
final.sil.data <- silhouette(cutree(fit, k = select.clst.num), d)
#add the cluster membership to the points coordinates dataframe
df$clust <- assignCluster(df, df, cutree(fit, k = select.clst.num))
#if there is a to.feat, calculate distance summary statistics by cluster
if(is.null(to.feat)==FALSE){
dist.summary <- tapply(df$minim.dist., df$clust, summary)
} else {
dist.summary <- "NA"
}
#add a variable to the input shapefile to store an ID for each feature
x$feat_ID <- 1:nx
#add a variable to the input shapefile to store the cluster membership taken from the df dataframe
x$clust <- df$clust
#set the layout of the plot visualization according to whether or not the parameter 'oneplot' is set to TRUE
if(oneplot==TRUE){
m <- rbind(c(1,2), c(3,4))
layout(m)
}
#plot the points with colour according to the cluster membership
#if there is not to.feat...
if(is.null(to.feat)==TRUE){
#plot the x feature
raster::plot(x,
pch=20,
col=x$clust,
main="Plot of the features colored by cluster membership",
cex.main=0.90,
axes=TRUE)
#show the ID of the feature if showID is equal to TRUE and the feature is NOT a polyline
if(showID==TRUE & class(x)[1]!="SpatialLinesDataFrame"){
raster::text(coordinates(x),
labels=x$feat_ID,
pos = 4,
cex=cex.feat.lab,
col=col.feat.lab)
}
}
#if there is a to.feat...
if(is.null(to.feat)==FALSE){
#calculate the extent of the x and to.feature
area.x <- (extent(x)[2] - extent(x)[1]) * (extent(x)[4] - extent(x)[3])
area.to.feat <- (extent(to.feat)[2] - extent(to.feat)[1]) * (extent(to.feat)[4] - extent(to.feat)[3])
#give precedence to plotting according to which feature has the larger extent
if(area.x > area.to.feat){
raster::plot(x,
pch=20,
col=x$clust,
main="Plot of the features colored by cluster membership",
cex.main=0.90,
axes=TRUE)
#show the ID of the feature if showID is equal to TRUE and the feature is NOT a polyline
if(showID==TRUE & class(x)[1]!="SpatialLinesDataFrame"){
raster::text(coordinates(x),
labels=x$feat_ID,
pos = 4,
cex=cex.feat.lab,
col=col.feat.lab)
}
raster::plot(to.feat, add=TRUE)
} else {
raster::plot(to.feat,
main="Plot of the features colored by cluster membership",
cex.main=0.90,
axes=TRUE)
raster::plot(x,
pch=20,
col=x$clust,
add=TRUE)
#show the ID of the feature if showID is equal to TRUE and the feature is NOT a polyline
if(showID==TRUE & class(x)[1]!="SpatialLinesDataFrame"){
raster::text(coordinates(x),
labels=x$feat_ID,
pos = 4,
cex=cex.feat.lab,
col=col.feat.lab)
}
}
}
#plot the dendrogram
graphics::plot(fit, main = "Clusters Dendrogram",
sub = paste0("\nAgglomeration method: ", aggl.meth),
xlab = "",
cex = cex.dndr.lab,
cex.main = 0.9,
cex.sub = 0.75)
#add the rectangles representing the selected partitions
solution <- rect.hclust(fit, k = select.clst.num, border="black")
#modify the row names of the final silhouette dataframe to also store the cluster to which each feature is closest
rownames(final.sil.data) <- paste(1:nx, final.sil.data[, 2], sep = "_")
#plot the silhouette chart
graphics::plot(final.sil.data,
cex.names = cex.sil.lab,
max.strlen = 30,
nmax.lab = nx + 1,
main = "Silhouette plot")
#add a line representing the average silhouette width
abline(v = mean(final.sil.data[, 3]), lty = 2)
#plot the average silhouette value at each cluster solution
graphics::plot(sil.res, xlab = "number of clusters",
ylab = "average silhouette width",
ylim = c(0, 1),
xaxt = "n",
type = "b",
main = "Average silhouette width vs. number of clusters",
sub = paste0("values on the y-axis represent the average silhouette width distribution at each cluster solution"),
cex.main=0.90,
cex.sub = 0.75)
axis(1, at = 0:max.ncl, cex.axis = 0.7)
#draw a black dot corresponding to the selected partition
graphics::points(x=select.clst.num,
y=sil.res[select.clst.num-1,2],
pch=20)
#draw the label selecting the row from the sil.res dataframe corresponding to the selected partition
graphics::text(x = select.clst.num,
y = sil.res[select.clst.num-1,2],
labels= round(sil.res[select.clst.num-1,2],3),
cex = 0.65,
pos = 3,
offset = 1.2,
srt = 90)
if(export==TRUE){
writeOGR(x, ".", "clustered_dataset", driver="ESRI Shapefile")
}
results <- list("dist.matrix"=d,
"avr.silh.width.by.n.of.clusters"=sil.res,
"partition.silh.data"=final.sil.data,
"coord.or.area.or.min.dist.by.clust"=df,
"dist.stats.by.cluster"=dist.summary,
"dataset"=x)
# restore the original graphical device's settings
par(mfrow = c(1,1))
return(results)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.