merge_clusters: Iteratively merges clusters in a way that improves predictive...

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

merge_clusters takes a clustering solution, generates all possible pairwise combinations of clusters, fits models to each combination, and merges the pair with the lowest delta AIC. The process is repeated iteratively

Usage

1
2
merge_clusters(data, clustering, family, n.iter = NULL, K = 1,
  quietly = FALSE)

Arguments

data

a data frame (or object that can be coerced by as.data.frame containing the "raw" multivariate data. This is not necessarily the data used by the clustering algorithm - it is the data on which you are testing the predictive ability of the clustering solution.

clustering

an initial clustering solution (to be iteratively merged) for data, that is, a vector of cluster labels (that can be coerced by as.factor). The number of cluster labels must match the number of rows of the object supplied in the data argument. The solution could for example come form a call to cutree, see Examples

family

a character string denoting the error distribution to be used for model fitting. The options are similar to those in family, but are more limited - see Details.

n.iter

the number of merging iterations to perform, by default it will merge down to 2 clusters

K

number of trials in binomial regression. By default, K=1 for presence-absence data (with cloglog link)

quietly

suppress messages during merging procedure

Details

merge_clusters is built on the premise that a good clustering solution (i.e. a classification) should provide information about the composition and abundance of the multivariate data it is classifying. A natural way to formalize this is with a predictive model, where group membership (clusters) is the predictor, and the multivariate data (site by variables matrix) is the response. merge_clusters fits linear models to each pairwise combination of a given set of clusters, and calculates their delta sum-of-AIC (that is, to the corresponding null model). The smallest delta AIC is taken to be the cluster pair that is most similar, so it is merged, and the process is repeated. Lyons et al. (2016) provides background, a detailed description of the methodology, and application of delta AIC on both real and simulated ecological multivariate abundance data.

At present, merge_clusters supports the following error distributions for model fitting:

Gaussian LMs should be used for 'normal' data. Negative Binomial and Poisson GLMs should be used for count data. Binomial GLMs should be used for binary and presence/absence data (when K=1), or trials data (e.g. frequency scores). If Binomial regression is being used with K>1, then data should be numerical values between 0 and 1, interpreted as the proportion of successful cases, where the total number of cases is given by K (see Details in family). Ordinal regression should be used for ordinal data, for example, cover-abundance scores. For ordinal regression, data should be supplied as either 1) factors, with the appropriate ordinal level order specified (see levels) or 2) numeric, which will be coerced into a factor with levels ordered in numerical order (e.g. cover-abundance/numeric response scores). LMs fit via manylm; GLMs fit via manyglm; proportional odds model fit via clm.

Value

a list containing the clustering solution (vector) at each merge iteration. The object is of class dsumaic, and can be directly passed to find_optimal.

Attributes for the data frame are:

family

which error distribution was used for modelling, see Arguments

K

number of cases for Binomial regression, see Arguments

Author(s)

Mitchell Lyons

References

Lyons et al. 2016. Model-based assessment of ecological community classifications. Journal of Vegetation Science, 27 (4): 704–715.

See Also

find_optimal, get_characteristic, S3 print function for 'daic' class, S3 residual plotting function

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## Not run: 
## Prep the 'swamps' data
## ======================

data(swamps) # see ?swamps
swamps <- swamps[,-1]

## Merge via AIC and compare to hclust heirarchy
## =============================================

## perhaps not the best clustering option, but this is base R
swamps_hclust <- hclust(d = dist(x = log1p(swamps), method = "canberra"),
                       method = "complete")

## generate iteratively merged clustering solutions, based on sum-of-AIC
clustering_aicmerge <- merge_clusters(swamps, cutree(tree = swamps_hclust, k = 30),
family = "poisson", n.iter = 20)

## compare to hclust heirarchy
optimal_aicmerge <- find_optimal(data = swamps, clustering = clustering_aicmerge,
family = "poisson")

optimal_hclust <- find_optimal(data = swamps, clustering = swamps_hclust,
family = "poisson", cutreeLevels = 10:30))

plot(optimal_aicmerge)
points(optimal_hclust, col = "red", pch = 16)

## End(Not run)

optimus documentation built on May 2, 2019, 12:07 p.m.