fhclustd: Hierarchic cluster analysis of probability densities

View source: R/fhclustd.R

fhclustdR Documentation

Hierarchic cluster analysis of probability densities

Description

Performs functional hierarchic cluster analysis of probability densities. It returns an object of class fhclustd. It applies hclust to the distance matrix between the T densities.

Usage

fhclustd(xf, group.name  = "group", gaussiand = TRUE, distance = c("jeffreys",
             "hellinger", "wasserstein", "l2", "l2norm"), windowh=NULL,
             data.centered = FALSE, data.scaled = FALSE, common.variance = FALSE,
             sub.title = "", filename = NULL, method.hclust = "complete")

Arguments

xf

object of class "folder" or data.frame.

  • If it is an object of class "folder", its elements are data frames with p numeric columns. If there are non numeric columns, there is an error. The t^{th} element (t = 1, \ldots, T) matches with the t^{th} group.

  • If it is a data frame, the column with name given by the group.name argument is a factor giving the groups. The other columns are all numeric; otherwise, there is an error.

group.name

string.

  • If xf is an object of class "folder", it is the name of the grouping variable in the returned results. The default is groupname = "group".

  • If xf is a data frame, it is the name of the column of xf containing the groups.

gaussiand

logical. If TRUE (default), the probability densities are supposed Gaussian. If FALSE, densities are estimated using the Gaussian kernel method.

If distance is "hellinger", "jeffreys" or "wasserstein", gaussiand is necessarily TRUE (see Details).

distance

The distance or divergence used to compute the distance matrix between the densities. It can be:

  • "jeffreys" (default) Jeffreys measure (symmetrised Kullback-Leibler divergence),

  • "hellinger" the Hellinger (Matusita) distance,

  • "wasserstein" the Wasserstein distance,

  • "l2" the L^2 distance,

  • "l2norm" the densities are normed and the L^2 distance between these normed densities is used;

If gaussiand = FALSE, the densities are estimated by the Gaussian kernel method and the distance can be "l2" (default) or "l2norm".

windowh

either a list of T bandwidths (one per density associated to a group), or a strictly positive number. If windowh = NULL (default), the bandwidths are automatically computed. See Details.

Omitted when distance is "hellinger", "jeffreys" or "wasserstein" (see Details).

data.centered

logical. If TRUE (default is FALSE), the data of each group are centered.

data.scaled

logical. If TRUE (default is FALSE), the data of each group are centered (even if data.centered = FALSE) and scaled.

common.variance

logical. If TRUE (default is FALSE), a common covariance matrix (or correlation matrix if data.scaled = TRUE), computed on the whole data, is used. If FALSE (default), a covariance (or correlation) matrix per group is used.

sub.title

string. If provided, the subtitle for the graphs.

filename

string. Name of the file in which the results are saved. By default (filename = NULL) the results are not saved.

method.hclust

the agglomeration method to be used for the clustering. See the method argument of the hclust function.

Details

In order to compute the distances/dissimilarities between the groups, the T probability densities f_t corresponding to the T groups of individuals are either parametrically estimated (gaussiand = TRUE) or estimated using the Gaussian kernel method (gaussiand = FALSE). In the latter case, the windowh argument provides the list of the bandwidths to be used. Notice that in the multivariate case (p>1), the bandwidths are positive-definite matrices. The distances between the T groups of individuals are given by the L^2-distances between the T probability densities f_t corresponding to these groups. The hclust function is then applied to the distance matrix to perform the hierarchical clustering on the T groups.

If windowh is a numerical value, the matrix bandwidth is of the form h S, where S is either the square root of the covariance matrix (p>1) or the standard deviation of the estimated density.

If windowh = NULL (default), h in the above formula is computed using the bandwidth.parameter function.

The distance or dissimilarity between the estimated densities is either the L^2 distance, the Hellinger distance, Jeffreys measure (symmetrised Kullback-Leibler divergence) or the Wasserstein distance.

  • If it is the L^2 distance (distance="l2" or distance="l2norm"), the densities can be either parametrically estimated or estimated using the Gaussian kernel.

  • If it is the Hellinger distance (distance="hellinger"), Jeffreys measure (distance="jeffreys") or the Wasserstein distance (distance="wasserstein"), the densities are considered Gaussian and necessarily parametrically estimated.

Value

Returns an object of class fhclustd, that is a list including:

distances

matrix of the L^2-distances between the estimated densities.

clust

an object of class hclust.

Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

See Also

fdiscd.predict, fdiscd.misclass

Examples

data(castles.dated)
stones <- castles.dated$stones
periods <- castles.dated$periods

periods123 <- periods[periods$period %in% 1:3, "castle"]
stones123 <- stones[stones$castle %in% periods123, ]
stones123$castle <- as.factor(as.character(stones123$castle))
yf <- as.folder(stones123)


# Jeffreys measure (default):

resultjef <- fhclustd(yf)
print(resultjef)
print(resultjef, dist.print = TRUE)
plot(resultjef)
plot(resultjef, hang = -1)

# Use cutree (stats package) to get the partition
cutree(resultjef$clust, k = 1:4)
cutree(resultjef$clust, k = 5)
cutree(resultjef$clust, h = 0.041)


# Applied to a data frame (Jeffreys measure):

fhclustd(stones123, group.name = "castle")

# Use cutree (stats package) to get the partition
cutree(resultjef$clust, k = 1:4)
cutree(resultjef$clust, k = 5)
cutree(resultjef$clust, h = 0.041)


# Hellinger distance:

resulthel <- fhclustd(yf, distance = "hellinger")
print(resulthel)
print(resulthel, dist.print = TRUE)
plot(resulthel)
plot(resulthel, hang = -1)

# Use cutree (stats package) to get the partition
cutree(resulthel$clust, k = 1:4)
cutree(resulthel$clust, k = 5)
cutree(resulthel$clust, h = 0.041)


## Not run: 
# L2-distance:

xf <- as.folder(stones)
result <- fhclustd(xf, distance = "l2")
print(result)
print(result, dist.print = TRUE)
plot(result)
plot(result, hang = -1)

# Use cutree (stats package) to get the partition
cutree(result$clust, k = 1:5)
cutree(result$clust, k = 5)
cutree(result$clust, h = 0.18)

## End(Not run)

periods123 <- periods[periods$period %in% 1:3, "castle"]
stones123 <- stones[stones$castle %in% periods123, ]
stones123$castle <- as.factor(as.character(stones123$castle))
yf <- as.folder(stones123)
result123 <- fhclustd(yf, distance = "l2")
print(result123)
print(result123, dist.print = TRUE)
plot(result123)
plot(result123, hang = -1)

# Use cutree (stats package) to get the partition
cutree(result123$clust, k = 1:4)
cutree(result123$clust, k = 5)
cutree(result123$clust, h = 0.041)

dad documentation built on Aug. 30, 2023, 5:06 p.m.