# fhclustd: Hierarchic cluster analysis of probability densities In dad: Three-Way / Multigroup Data Analysis Through Densities

 fhclustd R Documentation

## Hierarchic cluster analysis of probability densities

### Description

Performs functional hierarchic cluster analysis of probability densities. It returns an object of class `fhclustd`. It applies `hclust` to the distance matrix between the `T` densities.

### Usage

``````fhclustd(xf, group.name  = "group", gaussiand = TRUE, distance = c("jeffreys",
"hellinger", "wasserstein", "l2", "l2norm"), windowh=NULL,
data.centered = FALSE, data.scaled = FALSE, common.variance = FALSE,
sub.title = "", filename = NULL, method.hclust = "complete")
``````

### Arguments

 `xf` object of class `"folder"` or data.frame. If it is an object of class `"folder"`, its elements are data frames with `p` numeric columns. If there are non numeric columns, there is an error. The `t^{th}` element (`t = 1, \ldots, T`) matches with the `t^{th}` group. If it is a data frame, the column with name given by the `group.name` argument is a factor giving the groups. The other columns are all numeric; otherwise, there is an error. `group.name` string. If `xf` is an object of class `"folder"`, it is the name of the grouping variable in the returned results. The default is `groupname = "group"`. If `xf` is a data frame, it is the name of the column of `xf` containing the groups. `gaussiand` logical. If `TRUE` (default), the probability densities are supposed Gaussian. If `FALSE`, densities are estimated using the Gaussian kernel method. If `distance` is `"hellinger"`, `"jeffreys"` or `"wasserstein"`, `gaussiand` is necessarily `TRUE` (see Details). `distance` The distance or divergence used to compute the distance matrix between the densities. It can be: `"jeffreys"` (default) Jeffreys measure (symmetrised Kullback-Leibler divergence), `"hellinger"` the Hellinger (Matusita) distance, `"wasserstein"` the Wasserstein distance, `"l2"` the `L^2` distance, `"l2norm"` the densities are normed and the `L^2` distance between these normed densities is used; If `gaussiand = FALSE`, the densities are estimated by the Gaussian kernel method and the distance can be `"l2"` (default) or `"l2norm"`. `windowh` either a list of `T` bandwidths (one per density associated to a group), or a strictly positive number. If `windowh = NULL` (default), the bandwidths are automatically computed. See Details. Omitted when `distance` is `"hellinger"`, `"jeffreys"` or `"wasserstein"` (see Details). `data.centered` logical. If `TRUE` (default is `FALSE`), the data of each group are centered. `data.scaled` logical. If `TRUE` (default is `FALSE`), the data of each group are centered (even if `data.centered = FALSE`) and scaled. `common.variance` logical. If `TRUE` (default is `FALSE`), a common covariance matrix (or correlation matrix if `data.scaled = TRUE`), computed on the whole data, is used. If `FALSE` (default), a covariance (or correlation) matrix per group is used. `sub.title` string. If provided, the subtitle for the graphs. `filename` string. Name of the file in which the results are saved. By default (`filename = NULL`) the results are not saved. `method.hclust` the agglomeration method to be used for the clustering. See the `method` argument of the `hclust` function.

### Details

In order to compute the distances/dissimilarities between the groups, the `T` probability densities `f_t` corresponding to the `T` groups of individuals are either parametrically estimated (`gaussiand = TRUE`) or estimated using the Gaussian kernel method (`gaussiand = FALSE`). In the latter case, the `windowh` argument provides the list of the bandwidths to be used. Notice that in the multivariate case (`p`>1), the bandwidths are positive-definite matrices. The distances between the `T` groups of individuals are given by the `L^2`-distances between the `T` probability densities `f_t` corresponding to these groups. The `hclust` function is then applied to the distance matrix to perform the hierarchical clustering on the `T` groups.

If `windowh` is a numerical value, the matrix bandwidth is of the form `h S`, where `S` is either the square root of the covariance matrix (`p`>1) or the standard deviation of the estimated density.

If `windowh = NULL` (default), `h` in the above formula is computed using the `bandwidth.parameter` function.

The distance or dissimilarity between the estimated densities is either the `L^2` distance, the Hellinger distance, Jeffreys measure (symmetrised Kullback-Leibler divergence) or the Wasserstein distance.

• If it is the `L^2` distance (`distance="l2"` or `distance="l2norm"`), the densities can be either parametrically estimated or estimated using the Gaussian kernel.

• If it is the Hellinger distance (`distance="hellinger"`), Jeffreys measure (`distance="jeffreys"`) or the Wasserstein distance (`distance="wasserstein"`), the densities are considered Gaussian and necessarily parametrically estimated.

### Value

Returns an object of class `fhclustd`, that is a list including:

 `distances ` matrix of the `L^2`-distances between the estimated densities. `clust ` an object of class `hclust`.

### Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

fdiscd.predict, fdiscd.misclass

### Examples

``````data(castles.dated)
stones <- castles.dated\$stones
periods <- castles.dated\$periods

periods123 <- periods[periods\$period %in% 1:3, "castle"]
stones123 <- stones[stones\$castle %in% periods123, ]
stones123\$castle <- as.factor(as.character(stones123\$castle))
yf <- as.folder(stones123)

# Jeffreys measure (default):

resultjef <- fhclustd(yf)
print(resultjef)
print(resultjef, dist.print = TRUE)
plot(resultjef)
plot(resultjef, hang = -1)

# Use cutree (stats package) to get the partition
cutree(resultjef\$clust, k = 1:4)
cutree(resultjef\$clust, k = 5)
cutree(resultjef\$clust, h = 0.041)

# Applied to a data frame (Jeffreys measure):

fhclustd(stones123, group.name = "castle")

# Use cutree (stats package) to get the partition
cutree(resultjef\$clust, k = 1:4)
cutree(resultjef\$clust, k = 5)
cutree(resultjef\$clust, h = 0.041)

# Hellinger distance:

resulthel <- fhclustd(yf, distance = "hellinger")
print(resulthel)
print(resulthel, dist.print = TRUE)
plot(resulthel)
plot(resulthel, hang = -1)

# Use cutree (stats package) to get the partition
cutree(resulthel\$clust, k = 1:4)
cutree(resulthel\$clust, k = 5)
cutree(resulthel\$clust, h = 0.041)

## Not run:
# L2-distance:

xf <- as.folder(stones)
result <- fhclustd(xf, distance = "l2")
print(result)
print(result, dist.print = TRUE)
plot(result)
plot(result, hang = -1)

# Use cutree (stats package) to get the partition
cutree(result\$clust, k = 1:5)
cutree(result\$clust, k = 5)
cutree(result\$clust, h = 0.18)

## End(Not run)

periods123 <- periods[periods\$period %in% 1:3, "castle"]
stones123 <- stones[stones\$castle %in% periods123, ]
stones123\$castle <- as.factor(as.character(stones123\$castle))
yf <- as.folder(stones123)
result123 <- fhclustd(yf, distance = "l2")
print(result123)
print(result123, dist.print = TRUE)
plot(result123)
plot(result123, hang = -1)

# Use cutree (stats package) to get the partition
cutree(result123\$clust, k = 1:4)
cutree(result123\$clust, k = 5)
cutree(result123\$clust, h = 0.041)
``````

dad documentation built on Aug. 30, 2023, 5:06 p.m.