get_characteristic takes a clustering solution, fits models based on the underlying multivariate data, and determines 'important' variables for the clustering solution. In Ecology, particularly vegetation science, this is the process of determining characteristic (or diagnostic/indicator) species of a classification.
a data frame (or object that can be coerced by
a clustering solution for
a character string denoting the error distribution to be used for model fitting. The options are similar to those in
a character string, one of
logical, denoting whether significance should be returned also when
number of trials in binomial regression. By default, K=1 for presence-absence data (with cloglog link).
get_characteristic is built on the premise that a good clustering solution (i.e. a classification) should provide information about the composition and abundance of the multivariate data it is classifying. A natural way to formalize this is with a predictive model, where group membership (clusters) is the predictor, and the multivariate data (site by variables matrix) is the response.
get_characteristic fits linear models to each variable. If
type = "per.cluster" the coefficients corresponding to each level of the clustering solution for each variable are used to define the characteristic variables for each cluster level. If
type = "global", characteristic variables are determined (via delta AIC - larger values = more important) for the overall classification. If
signif = TRUE, delta AIC (that is, to the corresponding null model) and the coefficient standard errors are also returned with the per-cluster characteristic variables. We loosely define that the larger the coefficient (with larger delta AIC values and smaller standard errors guiding significance), the more characteristic that variable (species) is. Lyons et al. (2016) provides background, a detailed description of the methodology, and application of delta AIC on both real and simulated ecological multivariate abundance data.
get_characteristic supports the following error distributions for model fitting:
Negative Binomial (GLM with log link)
Poisson (GLM with log link)
Binomial (GLM with cloglog link for binary data, logit link otherwise)
Ordinal (Proportional odds model with logit link)
Gaussian LMs should be used for 'normal' data. Negative Binomial and Poisson GLMs should be used for count data. Binomial GLMs should be used for binary and presence/absence data (when
K=1), or trials data (e.g. frequency scores). If Binomial regression is being used with
data should be numerical values between 0 and 1, interpreted as the proportion of successful cases, where the total number of cases is given by
K (see Details in
family). Ordinal regression should be used for ordinal data, for example, cover-abundance scores. For ordinal regression, data should be supplied as either 1) factors, with the appropriate ordinal level order specified (see
levels) or 2) numeric, which will be coerced into a factor with levels ordered in numerical order (e.g. cover-abundance/numeric response scores). LMs fit via
manylm; GLMs fit via
manyglm; proportional odds model fit via
either a list of sorted characteristic variables for each cluster (of class
perclustchar) or a data frame containing the delta AIC values for each variable (of class
signif= is not
"none", then the corresponding significance metrics are appended.
Attributes for the object are:
which error distribution was used for modelling, see Arguments
the type of characteristic variables calculated, see Arguments
number of cases for Binomial regression, see Arguments
Lyons et al. 2016. Model-based assessment of ecological community classifications. Journal of Vegetation Science, 27 (4): 704–715.
find_optimal, S3 for print 'top-n' variables for each cluster, S3 for residual plots (at some stage)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## Prep the 'swamps' data ## ====================== data(swamps) # see ?swamps swamps <- swamps[,-1] ## Find characteristic species in a classification of the swamps data ## ================================================================== ## perhaps not the best clustering option, but this is base R swamps_hclust <- hclust(d = dist(x = log1p(swamps), method = "canberra"), method = "complete") # calculate per cluster characteristic species swamps_char <- get_characteristic(data = swamps, clustering = cutree(tree = swamps_hclust, k = 10), family = "poisson", type = "per.cluster") # look at the top 10 characteristic species for cluster 1 head(swamps_char[], 10) # calculate global characteristic species swamps_char <- get_characteristic(data = swamps, clustering = cutree(tree = swamps_hclust, k = 10), family = "poisson", type = "global") # top 10 characteristic species for the whole classification head(swamps_char, 10) ## See vignette for more explanation than this example ## ============================================================
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.