hclustdd: Hierarchic cluster analysis of discrete probability...
In dad: Three-Way / Multigroup Data Analysis Through Densities

hclustdd

R Documentation

Hierarchic cluster analysis of discrete probability distributions

Description

Performs functional hierarchic cluster analysis of discrete probability distributions. It returns an object of class hclustdd. It applies hclust to the distance matrix between the T distributions.

Usage

hclustdd(xf, group.name = "group", distance = c("l1", "l2", "chisqsym", "hellinger",
             "jeffreys", "jensen", "lp"), 
             sub.title = "", filename = NULL,
             method.hclust = "complete")

Arguments

`xf`	object of class `folder`, or list of arrays (or tables). If it is a folder, its elements are data frames with `q` columns (considered as factors). The `t^{th}` element (`t = 1, \ldots, T`) matches with the `t^{th}` group. If it is a data frame, the columns with name given by the `group.name` argument is a factor giving the groups. The other columns are all considered as factors. If it is a list of arrays (or tables), the `t^{th}` element (`t = 1, \ldots, T`) is the table of the joint frequency distribution of `q` variables within the `t^{th}` group. The frequency distribution is expressed with relative or absolute frequencies. These arrays have the same shape. Each array (or table) `xf[[i]]` has: the same dimension(s). If `q = 1` (univariate), `dim(xf[[i]])` is an integer. If `q > 1` (multivariate), `dim(xf[[i]])` is an integer vector of length `q`. the same dimension names `dimnames(xf[[i]])` (is non `NULL`). These dimnames are the names of the variables. The elements of the arrays are non-negative numbers (if they are not, there is an error).
`group.name`	string. Name of the grouping variable. Default: `group.name = "group"`.
`distance`	The distance or divergence used to compute the distance matrix between the discrete distributions (see Details). It can be: `"l1"` (default) the `L^p` distance with `p = 1` `"l2"` the `L^p` distance with `p = 2` `"chisqsym"` the symmetric Chi-squared distance `"hellinger"` the Hellinger metric (Matusita distance) `"jeffreys"` Jeffreys distance (symmetrised Kullback-Leibler divergence) `"jensen"` the Jensen-Shannon distance `"lp"` the `L^p` distance with `p` given by the argument `p` of the function.
`sub.title`	string. If provided, the subtitle for the graphs.
`filename`	string. Name of the file in which the results are saved. By default (`filename = NULL`) the results are not saved.
`method.hclust`	the agglomeration method to be used for the clustering. See the `method` argument of the `hclust` function.

Details

In order to compute the distances/dissimilarities between the groups, the T probability distributions f_t corresponding to the T groups of individuals are estimated from observations. Then the distances/dissimilarities between the estimated distributions are computed, using the distance or divergence defined by the distance argument:

If the distance is "l1", "l2" or "lp", the distances are computed by the function matddlppar. Otherwise, it can be computed by matddchisqsympar ("chisqsym"), matddhellingerpar ("hellinger"), matddjeffreyspar ("jeffreys") or matddjensenpar ("jensen").

Value

Returns an object of class hclustdd, that is a list including:

`distances`	matrix of the `L^2`-distances between the estimated densities.
`clust`	an object of class `hclust`.

Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

Examples

# Example 1 with a folder (10 groups) of 3 factors 
# obtained by converting numeric variables 
data(roses)
xr = roses[,c("Sha", "Den", "Sym", "rose")]
xr = cut(xr, breaks = list(c(0, 5, 7, 10), c(0, 4, 6, 10), c(0, 6, 8, 10)))
xf = as.folder(xr, groups = "rose")
af = hclustdd(xf)
print(af)
print(af, dist.print = TRUE)
plot(af)
plot(af, hang = -1)

# Example 2 with a data frame obtained by converting numeric variables
ar = hclustdd(xr, group.name = "rose")
print(ar)
print(ar, dist.print = TRUE)
plot(ar)
plot(ar, hang = -1)

# Example 3 with a list of 7 arrays
data(dspg)
xl = dspg
hclustdd(xl)

dad documentation built on April 12, 2025, 1:49 a.m.