Fast Hierarchical Clustering in Arbitrary Spaces Equipped With a Dissimilarity Measure
Description
An implementation of the fast, outlier resistant Genie algorithm described in (Gagolewski, Bartoszuk, Cena, 2016).
Usage
1 2 
Arguments
d 
an object of class 
objects 

thresholdGini 
single numeric value in [0,1], threshold for the Gini index, 1 gives the standard single linkage algorithm 
useVpTree 
single logical value, whether to use a vantagepoint tree to speed up nearest neighbor searching in lowdimensional spaces 
... 
internal tuning parameters 
Details
The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure – unless the clusters are wellseparated.
To overcome its limitations, in (Gagolewski, Bartoszuk, Cena, 2016) we proposed a new hierarchical clustering linkage criterion. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (here, the Gini index) of the cluster sizes does not increase drastically above a given threshold. The benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The algorithm can be run in parallel (via OpenMP) on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering.
For compatibility with hclust
, d
may be an object
of class dist
. In such a case, the objects
argument is ignored. Note that such an object requires ca. 8n(n1)/2
bytes of computer's memory, where n is the number of objects to cluster,
and therefore this setting can be used to analyse data sets of sizes
up to about 10,00050,000.
If objects
is a character vector or a list, then d
should be a single string, one of: levenshtein
(or NULL
),
hamming
, dinu
(Dinu, Sgarro, 2006),
or euclinf
(Cena et al., 2015).
Note that the list must consist
either of integer or of numeric vectors only (depending on the dissimilarity
measure of choice). On the other hand, each string must be in ASCII,
but you can always convert it to UTF32 with
stri_enc_toutf32
.
Otherwise, if objects
is a numeric matrix (here, each row
denotes a distinct observation), then d
should be
a single string, one of: euclidean_squared
(or NULL
),
euclidean
(which yields the same results as euclidean_squared
)
manhattan
, maximum
, or hamming
.
If useVpTree
is FALSE
, then the dissimilarity measure
of choice is guaranteed to be computed for each unique pair of objects
only once.
Value
A named list of class hclust
, see hclust
,
with additional components:

stats
 performance statistics 
control
 internal tuning parameters used
References
Cena A., Gagolewski M., Mesiar R., Problems and challenges of information resources producers' clustering, Journal of Informetrics 9(2), 2015, pp. 273284.
Dinu L.P., Sgarro A., A Lowcomplexity Distance for DNA Strings, Fundamenta Informaticae 73(3), 2006, pp. 361372.
Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlierresistant hierarchical clustering algorithm, Information Sciences, 2016, doi:10.1016/j.ins.2016.05.003.
Examples
1 2 3 4 