Introduction

Motivation: Why we need parametric model

General problem: find where species abundances are high vs. low in a way which leads to optimal classification by maximizing the contrast between the partitions.

Previous attempts: historical review, highlighting IndVal.

Issues with previous attempts:

Goals:

Why opticut?

The distributions of organisms on Earth show different levels of aggregation determined by climate, vegetation, species interactions, barriers to dispersal, and anthropogenic disturbances (cit. Humboldt, Whittaker, Rosenzweig). Species' responses to environmental factors thus allow predictions, i.e. identify regions or locations where a species can be found with higher probability, or in higher numbers. This knowledge of species-environment relationships has been at the core of human endeavor, e.g. knowing where to find food, medicinal plants, species used to make tools from or used as construction material, according to the archaeological (cit) record and studies in traditional ecological knowledge (cit). Species with strong environmental associations are often referred to as differential, character, or indicator species. These species are used to characterize certain habitats or vegetation types (Botta-Dukat), indicate naturalness or degradation of ecosystems (McGeoch & Chown), measure success of habitat restoration, alerting about critical community thresholds (TITAN), or indicate the presence of cryptic or rare species (Beals, Indpower, TWINSPAN).

There are three main types of algorithms that are used to quantify the degree of association for species. The contingency table based measures (e.g. the phi-coefficient, or Chi-square metric) that compare agreement in binary classifications based on some function of species' abundances and a priori classification of the environment (Chytry etc); Contingency table based methods quantify association as a correlation measure indicating the magnitude and sign (-1: maximum avoidance, 0: no association, +1: maximum preference). The major limitation of these correlation measures is that the use of binary classification for species observations is either based on presence/absence (detection/non-detection) data thus ignoring possibly available abundance information (e.g. >1 counts), or is based on arbitrary thresholds when binarizing the abundance data (Tsiripidis, Tichy). Uncertainty in the strength of associations can be expressed based on large sample asymptotics or randomization tests, comparing against the null hypothesis of no association (Agresti, Count data book).

The analysis of variance (ANOVA) based measures compare between and within group variance (F-ratio) in species abundance given an a priori classification (Jancey, Wildi). The F-ratio is used to rank species based on the degree of associations, but it is not explicitly testing the sign of the associations. Uncertainty in the strength of associations is commonly expressed using an F-test and corresponding p-value testing the null hypothesis of equal abundance. The parametric assumptions of the ANOVA imply normality and homoskedastic errors, which might not always be satisfied in most field situations (e.g. using 0/1, biomass, or % cover data).

The third and most widely adopted approach is the IndVal method that quantifies the concentration of species occurrence and abundances given an a-priory classification (Dufrene & Legendre). The IndVal index combines the species' abundance and occurrence information into a single index which reflects the magnitude of positive environment-associations (0-1). Uncertainty in the strength of associations can be quantified based on bootstrap (DeCaceres), but this approach is not testing against the null hypothesis of equal expected abundance within the partitions, because the method is non-parametric. The p-value for the null hypothesis is based on permutation tests. Randomization is used to derive the p-value is based on randomly placing samples or individuals, and this randomization might not always be meaningful for continuous input data (e.g. biomass, or % cover).

A common limitation of the available methods is that assumptions about the distribution (Binomial, Normal) or type (0/1, counts) of the species data are too restrictive. As a result, ecologists need to adjust the input data (binarization, rounding to integers) to meet the needs of the analysis options. However, non-count or non-normal data are commonplace, e.g. vegetation studies measuring the response on ordinal scale, % cover, or biomass. Besides, none of these approaches are designed to deal with some other aspects of field data, for example confounding variables, sample selection bias (presence-only data), sampling effort differences, or imperfect detection.

In this paper we introduce a general and extensible likelihood-based framework for indicator species analysis, that we call the opticut approach. The opticut approach provides a solution to the limitations of currently available and used options as listed above. We compare the power of traditional approaches and opticut to identify an indicator species when there is true indication in terms of abundance differences among partitions using simulations. We also show how to assess uncertainty in the strength of association, and also uncertainty in classification based on resampling, thus introducing a wider set of tools for statistical inference. Finally, we illustrate the breadth of situations where the use of opticut might be advantageous using case studies. We also provide the opticut R extension package that implements computationally efficient algorithm for finding indicator species, and tools for exploring and visualizing the results.

Theory

logLR, I, all/rank based combinations, uncertainty

Theory

The quest for optimal binary partitioning

$Y_{i}$'s are observations for a single species from $n$ locations ($i = 1, ..., n$). $g_{i}$'s are known discrete descriptors of the locations with $K$ levels ($K > 2$). $z^{(m)}$ is a binary reclassification of $g$ taking values (0, 1). The superscript $m = 1, ..., M$ indicates a possible combination of binary reclassification out of the total $M = 2^{K-1} - 1$ total combinations (excluding complements). See below for options for defining binary partitions. There can also be other site descriptors denoted as $x_{ij}$ taking discrete or continuous values ($j = 1, ..., p$; number of predictors).

A suitable parametric model describes the relationship between the observations and the site descriptors through the probability density function $P(Y_{i} = y_{i} \mid z_{i}^{(m)}, x_{ij}, \theta)$ where $\theta$ is the vector of model parameters: $\theta = (\beta_{0}, \beta_{1}, \alpha_{1}, ..., \alpha_{p})$. The choice of the parametric model depends on the nature of the observations. It can be Gaussian, Binomial, Poisson, ordinal, Beta regression, or zero-inflated models, with a suitable link function ($f$) for the mean: $f(\eta_{i}) = \beta_{0}^{(m)} + \beta_{1}^{(m)} z_{i}^{(m)} + \sum_{j=1}^{p} \alpha_{j}^{(m)} x_{ij}$.

$\widehat{\theta^{(m)}}$ is the maximum likelihood estimate (MLE) of the model parameters given the data and classification $m$, with corresponding log-likelihood value $l(\widehat{\theta^{(m)}}; y)$. Finding MLEs for all $M$ candidate binary partitions leads to a set of log-likelihood values. One can compare the log-likelihood values to a null model (no binary partition is necessary) where $\beta_{1} = 0$ leading to the MLE $\widehat{\theta^{(0)}}$ and corresponding log-likelihood value for the null model: $l(\widehat{\theta^{(0)}}; y)$.

The log-likelihood ratio for each candidate partition can be calculated as $l(\widehat{\theta^{(m)}}; y) - l(\widehat{\theta^{(0)}}; y)$. The best supported binary partition is the model with the highest log-likelihood ratio value.

One way of calculating the indicator value for each candidate partition is based on expected values using the inverse link function as $\mu_{0}^{(m)} = f^{-1}(\beta_{0}^{(m)})$ and $\mu_{1}^{(m)} = f^{-1}(\beta_{0}^{(m)} + \beta_{1}^{(m)})$. $I = 1 - min(\mu_{0}^{(m)}, \mu_{1}^{(m)}) / max(\mu_{0}^{(m)}, \mu_{1}^{(m)})$. Where $\mu_{0}^{(m)} = E[Y_{i} \mid z_{i}^{(m)}=0, x_{ij}=0]$ and $\mu_{1}^{(m)} = E[Y_{i} \mid z_{i}^{(m)}=1, x_{ij}=0]$ are expected values for the observations given the binary partition $z_{i}^{(m)}$ and at 0 value for all $x_{ij}$. This approach can be sensitive to the range of values supported by the link function. For example it works nicely with logarithmic or logistic link function where non-negativity of predicted values is ensured by definition. This is, however, not the case for the identity link in the Gaussian case, when negative values can invaludate the indicator value calculations as described above. (This usually happens when confounding variables are not centered and the intercept then reflects that difference as part of the baseline.)

As an alternative, one can use the estimate $\beta_{1}^{(m)}$ itself to express the contrast between the two strata. This also makes the index more comparable when different link functions are used. We used the hyperbolic tangent function (or inverse Fisher's $z$ transform) to scale the real valued $\beta_{1}^{(m)}$ into the unit range (0-1): $I = tanh(\mid \beta_{1}^{(m)} \mid) = \frac{exp(2 \mid \beta_{1}^{(m)} \mid) - 1}{exp(2 \mid \beta_{1}^{(m)} \mid) + 1}$. Positive and negative cases are taken as absolute values, so that the index reflects only the contrast between strata, and not the direction of it. Negative value can happen when using all combinations.

Finding all possible binary partitions

Finding all combinations does not require a model or observed responses. It only takes a classification vector with $K > 1$ partitions.

kComb returns a 'contrast' matrix corresponding to all possible binary partitions of the factor with K levels. Complements are not counted twice, i.e. (0,0,1,1) is equivalent to (1,1,0,0). The number of such possible combinations is $M = 2^{K-1} - 1$.

allComb this takes a classification vector with at least 2 levels and returns a model matrix with binary partitions. checkComb checks if combinations are unique and non-complementary (misfits are returned as attributes).

Rank based partitions

Blindly fitting a model to all possible partitions is wasteful use of resources. Instead, one can rank the $K$ partitions based on expected response values ($\mu_{1}, ..., \mu_{k}, ..., \mu_{K}$, where $\mu_{k}=E[Y_{i} \mid g_{i}=k, x_{ij}=0]$). This way we have to explore only $K-1$ partitions:

oComb return the 'contrast' matrix based on the rank vector as input. Rank 1 means lowest expected value among the partitions.

The function rankComb fits the model with multiple ($K > 2$) factor levels to find out the ranking, and returns a binary classification matrix similarly to allComb:

Note that the ranking varies from species to species, thus it is not possible to supply the resulting matrix as strata definition:

There is an overhead of fitting the model to calculate the ranking. But computing efficiencies can be still high compared to all partitions when the number of levels ($k$) is high.

A downside of this approach is that not all possible partitions are explored, thus the model weights do not represent all possible models, but only the top candidates. Thus model weight interpretation is different (i.e. cannot be used as a reliability matric, especially when support for the best model is not dominant).

Software implementation

Install, HPC, options

Distributions

Currently available distributions:

Special cases: ordered, RS(P)F

Extensions: N-mix, GAM, LMM/GLMM, etc.

Example workflow

summaries indices plots etc

Uncertainty

xxx

Extensibility

N-mix, mixed effects models

Conclusions

xxx

References

McGeoch MA and Chown SL (1998) Scaling up the value of bioindicators. Trends in Ecology and Evolution 13: 46--47.

Podani J and Csanyi B (2010) Detecting indicator species: some extensions of the INDVAL measure. Ecological Indicators 10: 1119-1124.

Hill MO (1979) TWINSPAN - A FORTRAN Program for Arranging Multivariate Data in an Ordered Two-way Table by Classification of the Individuals and Attributes. Ithaca, New York: Section of Ecology and Systematics, CornellUniversity.

Dufrene M and Legendre P (1997) Species assemblages and indicator species: The need for a flexible asymmetrical approach. Ecological Monographs 67: 345--366.

De Caceres M and Legendre P (2009) Associations between species and groups of sites: Indices and statistical inference. Ecology 90: 3566--3574.

De Caceres M, Legendre P, and Moretti M (2010) Improving indicator species analysis by combining groups of sites. Oikos 119: 1674--1684.

SANDER GREENLAND AND HAL MORGENSTERN 1989. Ecological Bias, Confounding, and Effect Modification. International Journal of Epidemiology, 18: 269--274.

Tsiripidis, Ioannis; Bergmeier, Erwin; Fotiadis, Georgios & Dimopoulos, Panayotis 2009. A new algorithm for the determination of differential taxa. Journal of Vegetation Science 20: 233--240.

Tichy, Lubomir & Chytry, Milan 2006. Statistical determination of diagnostic species for site groups of unequal size. Journal of Vegetation Science 17: 809--818.

Z. Botta-Dukat and A. Borhidi 1999. New objective method for calculating fidelity. Example: the Illyrian beechwoods. ANNALI DI BOTANICA LVII: 73--90.

Matthew E. Baker and Ryan S. King 2010 A new method for detecting and interpreting biodiversity and ecological community thresholds. Methods in Ecology and Evolution, 1, 25--37.

O. Wildi and E. Feldmeyer-Christe 2013. Indicator values (IndVal) mimic ranking by F-ratio in real-world vegetation data. COMMUNITY ECOLOGY 14(2): 139--143.

E. van der Maarel 1979. Transformation of Cover-Abundance Values in Phytosociology and Its Effects on Community Similarity. Vegetatio, Vol. 39: 97--114.

Chytry, Milan; Tichy, Lubomir; Holt, Jason & Botta-Dukat, Zoltan 2002. Determination of diagnostic species with statistical fidelity measures. Journal of Vegetation Science 13: 79--90.

KENNETH P. BURNHAM & DAVID R. ANDERSON 2004. Multimodel Inference - Understanding AIC and BIC in Model Selection. SOCIOLOGICAL METHODS & RESEARCH, 33: 261--304

Wildi, O. 1989. A new numerical solution to traditional phytosociological tabular classification. Vegetatio 81: 95-106.

McGeoch, M.A. and Chown, S.L.1998. Scaling up the value of bioindicators. Trends Ecol. Evol. 13: 46-47.

Jancey, R.C. 1979. Species ordering on a variance criterion. Vegetatio 39: 59--63.

library(opticut)
h <- as.factor(rep(letters[1:4], each=3))
y <- c(0,0,0, 0,0,1, 1,0,2, 3,2,1)
table(h, y)
ma <- opticut(y ~ 1, strata = h, comb = "all")
mr <- opticut(y ~ 1, strata = h, comb = "rank")
rc <- rankComb(y, matrix(1L, length(y), 1L), Z=h)

y
h
print(ma$species[[1]], cut=-Inf)
print(mr$species[[1]], cut=-Inf)
bestpart(ma)
bestpart(mr)
rc
round(attr(rc, "est"), 4)
allComb(h)

uc <- uncertainty(mr, type="multi", B=99)
uc$uncertainty


psolymos/opticut documentation built on Nov. 27, 2022, 11:29 a.m.