constr.lshclust: Space- And Time-Constrained Least Squares Clustering...

View source: R/constr.lshclust.R

constr.lshclustR Documentation

Space- And Time-Constrained Least Squares Clustering (Experimental)

Description

Function constr.lshclust carries out space-constrained or time-constrained agglomerative clustering from a data matrix.

Usage

constr.lshclust(x, links, coords, chron = FALSE, output = "RSS")

Arguments

x

A data matrix

links

A list of edges (or links) connecting the points. May be omitted in some cases; see details and examples

coords

Coordinates of the observations (data rows) in matrix d. The coordinates are used for plotting maps of the clustering results; may be omitted. A matrix or data frame with two columns, following the convention of the Cartesian plane: first column for abscissa, second column for ordinates. See examples

chron

Logical (TRUE or FALSE) indicating whether a chronological (i.e. time-constrained) clustering should be calculated (default: chron = FALSE)

output

The type of edge lengths to return: square root sums of squares ("RSS", the default), sums of squares ("SS"), Euclidean distance between centroids ("D"), square Euclidean distance between centroids ("D2")

Details

Agglomerative clustering can be carried out with a constraint of spatial or temporal contiguity. This means that only the objects that are linked in links are considered to be candidates for clustering: the next pair of objects to cluster will be the pair that has the lowest dissimilarity value among the pairs that are linked.

The same rule applies during the subsequent clustering steps, which involve groups of objects: the list of links is updated after each agglomeration step. All objects that are neighbours of one of the components that have fused are now neighbours of the newly formed cluster.

The edges (links) are specified using argument links, which can be an object of class nb (see, e.g., tri2nb), an object of class listw (see, e.g., nb2listw), a two-element list or an object coercible as a such (e.g., a two-column dataframe), or a two-column matrix with each row representing an edge and the columns representing the two ends of the edges. For lists with more than two elements, as well as dataframes or matrices with more than two-columns only the first two elements or columns are used for the analysis. The edges are interpreted as being non directional; there is no need to specify an edge going from point a to point b and one going from point b to point a. While doing so is generally inconsequential for the analysis, it carries some penalty in terms of computation time. It is a good practice to place the nodes in increasing order of numbers from the top to the bottom and from the left to the right of the list but this is not mandatory. A word of caution: in cases where clusters with identical minimum distances occur, the order of the edges in the list may have an influence on the result. Alternative results would be statistically equivalent.

When argument link is omitted, regular (unconstrained) clustering is performed and a hclust-class object is returned unless argument chron = TRUE. When argument chron = TRUE, chronological clustering is performed, taking the order of observations as their positions in the sequence. Argument links is not used when chron = TRUE. Argument chron allows one to perform a chronological clustering in the case where observations are ordered chronologically. Here, the term "chronologically" should not be taken restrictively: the method remains applicable to other sequential data sets such as spatial series made of observations along a transect.

When the graph described by link is not entirely connected, the resulting disjoint clusters (or singletons) are merged in the order of their indices (ie. the two clusters with smallest indices are merged until all of them have been merged), the dissimilarity at which these clusters are merged ($height) is assumed to be a missing value (NA), and a warning message is issued to warn the user about the presence and number of disjoint clusters.

Memory storage and time to compute constrained clustering for N objects. — The least squares clustering procedure generally uses less computer memory as does the Lance and Williams algorithm implemented by function constr.hclust because it does not use dissimilarity matrices. Internally, the function makes to copies of data matrix x, one that is used to accumulate the values as the clusters are being formed and one that is squared and used to accumulate the squared values during the clustering process. The amount of memory needed to store accumulation arrays for N observations described by M variables as 64-bit double precision floating point variables (IEEE 754) is the 8*N*M*2 bytes. It scales linearly with increasing sample size. By contrast, the dissimilarity matrix used by the Lance and Williams algorithm needs 8*N*(N-1)/2 bytes of storage; as much storage as when M = (N-1)/4. Since M is much smaller than (N-1)/4 for many practical cases, especially those when N is very large (e.g., many hundred thousands or millions), performing least squares hierarchical clustering will be faster (See the Benchmarking example below), require less storage, and be applicable to cases with larger N than the distance-based Lance and Williams algorithm, but at the price of a single classificatory criterion (i.e., within-cluster least squares).

With large data sets, a manageable output describing the classification of the sites is obtained with function cutree(x, k) where k is the number of groups.

Value

A constr.hclust-class object.

Author(s)

Pierre Legendre pierre.legendre@umontreal.ca and Guillaume Guénard guillaume.guenard@umontreal.ca

References

Guénard, G. and P. Legendre. 2022. Hierarchical clustering with contiguity constraint in R. Journal of Statistical Software 103(7): 1-26 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v103.i07")}

Legendre, P. and L. Legendre. 2012. Numerical ecology, 3rd English edition. Elsevier Science BV, Amsterdam. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/S0304-3800(00)00291-X")}

See Also

plot.constr.hclust, hclust, and cutree

Examples


## First example: Artificial map data from Legendre & Legendre
##                (2012, Fig. 13.26): n = 16

dat <- c(41,42,25,38,50,30,41,43,43,41,30,50,38,25,42,41)
coord.dat <- matrix(c(1,3,5,7,2,4,6,8,1,3,5,7,2,4,6,8,
                      4.4,4.4,4.4,4.4,3.3,3.3,3.3,3.3,
                      2.2,2.2,2.2,2.2,1.1,1.1,1.1,1.1),16,2)


## Obtaining a list of neighbours:
library(spdep)
listW <- nb2listw(tri2nb(coord.dat), style="B")
links.mat.dat <- listw2mat(listW)
neighbors <- listw2sn(listW)[,1:2]

## Display the points:
plot(coord.dat, type='n',asp=1)
title("Delaunay triangulation")
text(coord.dat, labels=as.character(as.matrix(dat)), pos=3)
for(i in 1:nrow(neighbors))
    lines(rbind(coord.dat[neighbors[i,1],],
                coord.dat[neighbors[i,2],]))

## Clustering without a contiguity constraint;
## the result is represented as a dendrogram:
grpWD2_constr_lshclust <- constr.lshclust(x=dat, output="RSS")
plot(grpWD2_constr_lshclust, hang=-1)

## Clustering with a contiguity constraint described by a list of
## links:
grpWD2cst_constr_lshclust <-
    constr.lshclust(
        dat, neighbors,
        coord.dat, output="RSS")

## Plot the results on a map with k=3 clusters:
plot(grpWD2cst_constr_lshclust, k=3, links=TRUE, las=1, xlab="Eastings",
     ylab="Northings", cex=3, lwd=3)

## Generic functions from hclust can be used, for instance to obtain
## a list of members of each cluster:
cutree(grpWD2cst_constr_lshclust, k=3)

## Now with k=5 clusters:
plot(grpWD2cst_constr_lshclust, k=5, links=TRUE, las=1, xlab="Eastings",
     ylab="Northings", cex=3, lwd=3)
cutree(grpWD2cst_constr_lshclust, k=5)

## End of the artificial map example

## Third example: Scotch Whiskey distilleries clustered using four tasting
## scores (nose, body, palate, and finish) constrained with respect to the
## distillery locations in Scotland.

## Documentation file about the Scotch Whiskey data: ?ScotchWhiskey

data(ScotchWhiskey)

## Cluster analyses for the colour, nose, body, palate, and finish using
## least squares on the basis of the Mahalanobis metric.

## Combining the data matrices:
cols <-
    contr.treatment(
        n = nlevels(ScotchWhiskey$colour)
    )[ScotchWhiskey$colour,]
dimnames(cols) <- list(
    rownames(ScotchWhiskey$geo[,"Distillery"]),
    levels(ScotchWhiskey$colour)[-1L]
)
WhiskeyDat <- cbind(
    cols,
    ScotchWhiskey$body,
    ScotchWhiskey$palate,
    ScotchWhiskey$finish
)
rm(cols)

## Transforming WhiskeyDat into an orthonormal matrix using the Cholesky
## factorization: the least squares will relate to the Mahalanobis metric.

WhiskeyTr <- WhiskeyDat %*% solve(chol(cov(WhiskeyDat)))
grpWD2cst_ScotchWhiskey <-
    constr.lshclust(
        x=WhiskeyTr,
        links=ScotchWhiskey$neighbors@data,
        coords=ScotchWhiskey$geo@coords/1000
    )

## A fonction to plot the Whiskey clustering results:

plotWhiskey <- function(k) {
    par(fig=c(0,1,0,1))
    plot(grpWD2cst_ScotchWhiskey, k=k, links=TRUE, las=1,
         xlab="Eastings (km)", ylab="Northings (km)", cex=0.1, lwd=3)
    text(ScotchWhiskey$geo@coords/1000,labels=1:length(ScotchWhiskey$geo))
    legend(x=375, y=700, lty=1L, lwd=3, col=rainbow(1.2*k)[1L:k],
           legend=sprintf("Group %d",1:k), cex=1.25)
    SpeyZoom <- list(xlim=c(314.7,342.2), ylim=c(834.3,860.0))
    rect(xleft=SpeyZoom$xlim[1L], ybottom=SpeyZoom$ylim[1L], col="#E6E6E680",
         xright=SpeyZoom$xlim[2L], ytop=SpeyZoom$ylim[2L], lwd=2, lty=1L)
    par(fig=c(0.01,0.50,0.46,0.99), new=TRUE)
    plot(grpWD2cst_ScotchWhiskey, xlim=SpeyZoom$xlim,
         ylim=SpeyZoom$ylim, k=k, links=TRUE, las=1, xlab="", ylab="",
         cex=0.1, lwd=3, axes=FALSE)
    text(ScotchWhiskey$geo@coords/1000,labels=1:length(ScotchWhiskey$geo))
    rect(xleft=SpeyZoom$xlim[1L], ybottom=SpeyZoom$ylim[1L],
         xright=SpeyZoom$xlim[2L], ytop=SpeyZoom$ylim[2L], lwd=2, lty=1L)
}

## Plot the clustering results on the map of Scotland for 5 groups.
## The inset map shows the Speyside distilleries in detail:

plotWhiskey(k=5L)

## End of the Scotch Whiskey tasting data example

## Not run: 

## Third example: Fish community composition along the Doubs River,
## France. The sequence is analyzed as a case of chronological
## clustering, substituting space for time.

library(ade4)
data(doubs, package="ade4")

## Using the Hellinger metric on the species abundances:
Doubs.hel <- sqrt(doubs$fish / rowSums(doubs$fish))
Doubs.hel[rowSums(doubs$fish)==0,] <- 0
grpWD2cst_fish <- constr.lshclust(x=Doubs.hel, chron=TRUE,
                                  coords=as.matrix(doubs$xy))

plot(grpWD2cst_fish, k=5, las=1, xlab="Eastings (km)",
     ylab="Northings (km)", cex=3, lwd=3)

## Repeat the plot with other values of k (number of groups)

## End of the Doubs River fish assemblages example


## Benchmarking example
## Benchmarking can be used to estimate computation time for different
## values of N (number of sites) and M (number of variables).

require(magrittr)
require(pryr)

benchmark <- function(N, M) {
    res <- matrix(NA,length(N)*length(M),4L) %>% as.data.frame
    colnames(res) <- c("N(obs)","M(var)","Storage (MiB)","Time (sec)")
    res[[1L]] <- rep(N,length(M))
    res[[2L]] <- rep(M,each=length(N))
    
    for(i in 1:nrow(res)) {
        ## i=1L
        cat("N:",res[i,1L]," M:",res[i,2L],"\n")
        coords.mem <- cbind(x=runif(res[i,1L],-1,1),y=runif(res[i,1L],-1,1))
        if(i>1L) rm(dat.mem) ; gc()
        dat.mem <- try(matrix(runif(res[i,1L]*res[i,2L],0,1),
                              res[i,1L],res[i,2L]))
        if(any(class(dat.mem)=="try-error"))
            break
        neighbors.mem <-
            (coords.mem %>%
             tri2nb %>%
             nb2listw(style="B") %>%
             listw2sn)[,1:2]
        {start.time = Sys.time()
         res.mem <- try(constr.lshclust(dat.mem, neighbors.mem))
         end.time = Sys.time()}
        if(any(class(res.mem)=="try-error"))
            break
        res[i,3L] <- (3*object_size(dat.mem) + object_size(neighbors.mem) +
                      object_size(res.mem))/1048576  # n. bytes per MiB
        res[i,4L] <- as.numeric(end.time) - as.numeric(start.time)
    }
    res[["N(obs)"]] <- as.integer(res[["N(obs)"]])
    res[["M(var)"]] <- as.integer(res[["M(var)"]])
    res
}

N <- c(1000,2000,5000,10000,20000,50000)
M <- c(1,2,5,10,20,50)
res <- benchmark(N, M)

## Plotting the results:
par(mar=c(3,6,2,2),mfrow=c(2L,1L))
barplot(height = matrix(res[,"Time (sec)"],length(N),length(M)),
        names.arg = N, ylab = "Time (seconds)\n", xlab = "",
        las = 1L, log = "y", beside=TRUE)
par(mar=c(5,6,0,2))
barplot(height = matrix(res[,"Storage (MiB)"],length(N),length(M)),
        names.arg = N, ylab = "Total storage (MB)\n",
        xlab = "Number of observations", las = 1L, log = "y", beside=TRUE)

## Examine the output file
res

## Analyze how computing time and storage scales up with increasing number
## of observations and variables.
lm(log(`Time (sec)`)~log(`N(obs)`)+log(`M(var)`), data=res)
lm(log(`Storage (MiB)`)~log(`N(obs)`)+log(`M(var)`), data=res)

## End(Not run)

## End of the benchmarking example

## End of examples


guenardg/constr.hclust documentation built on July 13, 2024, 3:03 p.m.