Introduction

Identifying and monitoring indicator species has long been considered a cost-effective way of tracking environmental change or the status of the biota. Examples include the characterization of vegetation types (Chytry et al. 2002), degradation of ecosystems (McGeoch & Chown 1988), or signalling cryptic or rare species (Halme et al. 2009). Throughout these examples, a key attribute of indicator species (also referred to as character or differential species) is that they have strong associations with the environmental variables that they are supposed to indicate.

Approaches to quantify the degree of environmental associations for species (indicator value) traditionally falls into three major types of approaches:

  1. contingency table based measures (De Caceres & Legendre 2009);
  2. analysis of variance (ANOVA; Wildi & Feldmeyer-Christe 2013); and
  3. the widely used non-parametric IndVal method (Dufrene & Legendre 1997).

While the different approaches have strong appeal and applications, they do not always meet the challenges presented by ecological data.

Ecological data come in different forms: binary, ordinal, count, abundance, or presence only data. Some of these data types are suitable for a particular approach, while some formats need 'tweaking'. For example, binarizing abundance or count data for contingency tables leads to information loss. ANOVA, on the other hand, implies normality and homoscedastic errors, which might not always be satisfied by 0/1, ordinal, skewed, or percent cover data. Finally, randomization test for the IndVal approach requires count data, which renders hypothesis testing difficult if not impossible for continuous or ordinal data.

Another staple of observation field studies is the presence of modifying or confounding variables, or the presence of systematic biases (variable sampling effort, imperfect detectability, sample selection bias). Ignoring these effects can lead to erroneous indicator species analysis (Zettler et al. 2013). Controlling for these effects can improve the assessment of species-environment relationships, thus lead to better evaluation of indicator species.

To address these limitations, Kemencei et al. (2014) proposed a model-based indicator species analysis that accounted for the effects of modifying variables, and non-independence in the data due to paired sampling design. This model-based approach has been generalized and made available in the opticut R extension package. The opticut package offers computationally efficient and extensible algorithms for finding indicator species, tools for exploring and visualizing the results, and quantifying uncertainties. This manual showcases the functionality of the package.

Install

The opticut R package can be installed from the Comprehensive R Archive Network (CRAN) as:

install.packages("opticut")

Install development version from GitHub:

library(devtools)
install_github("psolymos/opticut")

User visible changes in the package are listed in the NEWS file.

Report a problem

Use the issue tracker to report a problem.

License

GPL-2

Loading the package

To get started, open R and load the opticut package as:

library(opticut)

Patritioning

Optimal partitioning (optimal cut, or in short: opticut) is found for each species independent of each other. We make observations ($y_i$) of possibly multiple species at $i=1,...,n$ sites. Now let us consider a discrete site descriptor ($g_i$) with $K$ levels or strata ($k=1,\ldots,K; K>2$). This stratification might come from remotely sensed or other geospatial information, field measurements, or from multivariate clustering. We can use $g_i$ to create $m=1,...,M$ possible binary partitions based on coding one or more levels as 1s and the rest with 0s. We denote any such possible partition as $z^{(m)}$. The total number of binary partitions is $M=2^{K-1}-1$, not counting cases when 0s or 1s are completely missing (which is the null model). The opticut method, as opposed to for example IndVal method is invariant to the coding of 0s and 1s in $z^{(m)}$. This means that complementary cases, such as $z^{(m)}$ and $1-z^{(m)}$, are treated as interchangeable. The opticut package provides utility functions to create and check binary partitions from multi-level vectors (kComb, allComb, checkComb).

Manipulating partitions

Finding all combinations does not require a model or observed responses. It only takes a classification vector with $K > 1$ strata. The kComb function returns a 'contrast' matrix corresponding to all possible binary partitions of the factor with $K$ levels:

kComb(k = 2)
kComb(k = 3)
kComb(k = 4)

allComb takes a classification vector with at least 2 levels and returns a model matrix with binary partitions. checkComb checks if combinations are unique and non-complementary (misfits are returned as attributes):

## finding all combinations
(f <- rep(LETTERS[1:4], each=2))
(mc <- allComb(f, collapse = "_"))
## checking for complementary entries
checkComb(mc) # TRUE
## adding complementary entries to the matrix
mc2 <- cbind(z = 1 - mc[,1], mc[,c(1:ncol(mc), 1)])
colnames(mc2) <- 1:ncol(mc2)
mc2
checkComb(mc2) # FALSE

Choosing a parametric model

A suitable parametric (or semi-parametric) model can be chosen to describe the relationship between the observations for a single species and the site descriptors. The choice of the parametric model depends on the nature of the observations and the goals of the study. The systematic component of the model (also called the linear predictor), $f(\mu_i)=\beta_0^{(m)}+\beta_1^{(m)} z_i^{(m)}+\sum_{j=1}^p \alpha_j^{(m)} x_{ij}$, is linked to the random component of the model through the link function $f$. The expected value is given by the inverse link function: $E[Y_i]=\mu_i=f^{-1}(\beta_0^{(m)}+\beta_1^{(m)} z_i^{(m)}+\sum_{j=1}^p \alpha_j^{(m)} x_{ij})$. Expected values can then be estimated for each partition. The symbol $x_{ij}$ denotes other site descriptors ($j=1,...,p$; number of predictors besides $g_i$) that can take discrete or continuous values. These variables might describe variation in the observations not fully explained by the partitions, e.g. due to spatially uneven distribution, differences in sampling effort, or environmental variables interfering with the observation process.

We can estimate the parameters in the linear predictor (and possibly other "nuisance"" factors such as variance components in mixed effects models) and calculate expected values. The probability density function for the model $P(Y_i=y_i \mid z_i^{(m)},x_{ij},\theta)$ is used to find the maximum likelihood estimates (MLE) of the model parameters $\hat{\theta}^{(m)}=(\hat{\beta}_0^{(m)},\hat{\beta}_1^{(m)},\hat{\alpha}_1^{(m)},\ldots,\hat{\alpha}_p^{(m)})$ that jointly maximize the log-likelihood function. The log-likelihood function evaluated at the MLE is $l(\hat{\theta}^{(m)}; y)$.

The opticut package has several built-in distributions that can be specified though the dist argument. Currently available distributions:

Other distributions can be specified by user-defined functions as explained later in the manual.

All combinations

Fitting the model to all the M candidate binary partitions leads to a set of log-likelihood values. One can compare the log-likelihood values $l(\hat{\theta}^{(m)}; y)$) to the log-likelihood value based on the null model $l(\hat{\theta}^{(0)}; y)$. We define the null model the same way as the other $M$ models but without the binary partition: $\beta_1^{(m)}=0$. The log of the likelihood ratio between the M candidate models and the null model can be calculated as $l(\hat{\theta}^{(m)}; y)-l(\hat{\theta}^{(0)}; y)$).

The best-supported model $m'$ and the corresponding binary partition $z^{(m')}$ is the model with the highest log-likelihood ratio value $l(\hat{\theta}^{(m')}; y)$. Model weights are calculated as $w_m=exp{l(\hat{\theta}^{(m)}; y)-l(\hat{\theta}^{(m')}; y)} / \sum_{m=1}^{M} exp{l(\hat{\theta}^{(m)}; y)-l(\hat{\theta}^{(m')}; y)}$. These weights sum to 1 and indicate asymptotic probabilities of finding the same best partition when the sampling is replicated. The concentration of asymptotic probabilities among the models can be expressed through the Simpson index $H=\sum_{m=1}^{M} w_{m}^2$. High values of $H$ indicate high concentration of the model weights for one or few model out of the total number of models compared (Kemencei et al. 2014).

The opticut package provides the opticut1 function to fit a chosen parametric model to a set of binary partitions and the summary returns the model output for each candidate partition with log likelihood ratios and model weights.

## stratification
g <- c(1,1,1,1, 2,2,2,2, 3,3,3,3)
## abundance
y <- c(0,0,3,0, 2,3,0,5, 5,6,3,4)
mods <- opticut1(Y = y, Z = allComb(g), dist = "gaussian")
mods

Not all the models are printed, only the ones where the log likelihood ratio (logLR) value is $\geq2$ by default. We can change this by explicitly defining the cut argument in the print method (or by setting the cut global option via ocoptions, as explained later):

print(mods, cut = -Inf)

Model support across the partitions can be visualized by the model weight plot (wplot):

wplot(mods, cut = -Inf)

The opticut function can take matrices or a model formula as its input, and repeats the procedure done by opticut1 for multiple species in a community matrix. The summary shows the best-supported model for each species. It is the preferred way of specifying the model for single species as well. Single species results are part of the $species element of the output object:

(oc <- opticut(y, strata = g, comb = "all", dist = "gaussian"))
summary(oc)
oc$species

The use of the opticut1 function is generally discouraged: some of the internal checks are not guaranteed to flag issues when the formula-to-model-matrix translation is side-stepped (this is what is happening when the modifier variables are supplied as X argument in opticut1). Use the opticut with a single species instead, as shown above.

Indicator value

Once a model is fit a to a given binary partition, we can quantify the indicator value for the species. The indicator value denoted by $I^{(m)}$ describes the contrast between the two subset of the data represented by the binary partition $z^{(m)}$. We define indicator value as a scaled version of the $\beta_1^{(m)}$ coefficient estimate: $I^{(m)}= \mid tanh(c\beta_1^{(m)}) \mid$, where $c=0.5$. The hyperbolic tangent function is used as an inverse Fisher transformation to scale real valued coefficients into the [-1,1] range. The absolute value then results in a [0, 1] range for the indicator value.

The $c$ is a scaling constant that modifies the shape of the function. We chose 0.5 as the default value that allows the indicator value to change more gradually according to our experience with real-world data sets. The $c=0.5$ setting is also identical to an inverse logistic function transformed into the [-1, 1] range.

The beta2i function is used internally to calculate the indicator value. Here we show the effect of the scaling constant $c$ on the shape of the function, the default $c=0.5$ is in green:

beta <- seq(-5, 5, 0.1)
Col <- occolors(c("red", "blue"))(10)
Col[6] <- "#00FF00"
plot(beta, beta2i(beta), type = "n")
s <- seq(1, 0.1, -0.1)
for (i in 1:10) {
    lines(beta, beta2i(beta, scale = s[i]), col = Col[i])
    text(1.5 - 0.2, beta2i(1.5, scale = s[i]), s[i], col = Col[i])
}

An alternative way to define the indicator value would be taking the relative difference between the expected values for the 0 and 1 stratum in $z^{(m)}$. This, however, depends on the response scale and the baseline values chosen for possible modifying effects.

Our definition of indicator value is on the linear predictor scale and is more readily compared across species without respect to their relative abundance and values of other modifying factors. Note however, that the meaning of the indicator value might be quite different for studies using different parametric models: it is a difference in the Gaussian case, multiplier in log-linear models, change in log-odds in logistic or ordinal regression. The indicator value given a binary partition is returned by the model summaries, and used in visualization:

plot(oc)

Use the ocoptions function to change the default scale ($c$) setting.

User defined combinations

It is possible to

comb <- cbind(
    A = c(rep(1, 4), rep(0, 8)),
    B = c(rep(0, 4), rep(1, 4), rep(0, 4)))
comb
print(opticut1(Y = y, Z = comb, dist = "gaussian"), cut = -Inf)

If the user happen to define complementary partitions, an error message is thrown:

comb <- cbind(comb, 1-comb)
colnames(comb) <- LETTERS[1:4]
comb
try(opticut1(Y = y, Z = comb, dist = "gaussian"))
checkComb(comb)

The global option check_comb can be set to override this default behavior, although there is no real point in duplicating reparametrized but otherwise identical models:

op <- ocoptions(check_comb = FALSE, cut = -Inf)
opticut1(Y = y, Z = comb, dist = "gaussian")
ocoptions(op)

Rank based combinations

The IndVal method requires the algorithm to evaluate $2^K-1$ binary partitions. Our opticut approach is parametrization invariant with respect to coding the levels in the binary partitions (it affects the intercept term but not the contrast or the log likelihood ratio). This effectively halves the number of partitions we need to compare ($2^{K-1}-1$, comb = "all" in opticut). Still, the number of partitions increases according to powers of 2. Here we propose an approach that increases linearly with $K$. This algorithm is based on sorting all the $K$ partitions in $g$ according to increasing order of the linear predictor estimates for $K$ coefficients (as opposed to estimating 2 coefficients for a binary partition). The logic follows from the fact that the optimal binary partitioning tries to find the best split in terms of likelihood ratio with lower estimates on one side, and higher estimates on the other side of the split. As a consequence, we only need to try $K-1$ binary partitions to find the optimal $z^{(m')}$. This algorithm is implemented in the rankComb function that is called by opticut with the argument comb = "rank".

The function rankComb evaluates a model with a $K$-level factor and returns a corresponding partitioning matrix. Attributes hold the estimates for the $K$ levels:

rankComb(Y = y, Z = as.factor(g), dist = "gaussian", collapse = "_")

The collapse argument can be important to make partitions more distinguishable. The global option is " " that can be modified via the ocoptions function. It can cause problems when the the collapse character is part of the factor levels used for $g$. The fix_levels function comes handy for fixing these levels. The function replaces the collapse character to something else:

getOption("ocoptions")$collapse
fix_levels(as.factor(c("A b", "C d")), sep=":")
fix_levels(as.factor(c("A b", "C d")), sep="")

There is an overhead of fitting the model to calculate the ranking first. But computing efficiencies can be still high compared to all partitions. As a consequence of the ranking process, we do not have summaries for all the possible binary partitions, only for the top candidates. Moreover, the partitions produced for each species might not be identical. Therefore the model weights ($w$) and Simpson index ($H$) have different interpretation and cannot be that easily compared across species, unless the model weights are highly concentrated for the top models. In this case, the sum of weights for the missing models becomes negligible. Another consequence of the ranking process is that $\beta_1^{(m)}$ estimates are always positive. In the case of comparing all the partitions, the full set of partitions is fixed for all species, and some respond positively while others respond negatively to the same binary variable in terms of the $\beta_1^{(m)}$ values. Thus it is required to store the sign of the relationship as part of the summary.

The comb = "rank" is the default setting in opticut. It is clear that the 2 approaches lead to identical best partitions. Only the model weights ($w$) are different:

summary(opticut(y, strata = g, comb = "all", dist = "gaussian"))
summary(opticut(y, strata = g, comb = "rank", dist = "gaussian"))

Here is how the ranking info is turned into binary partitions internally:

## simulate some data
set.seed(1234)
n <- 200
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
lam <- exp(0.5 + 0.5*x1 + -0.2*x2)
Y <- rpois(n, lam)

## binary partitions
head(rc <- rankComb(Y, model.matrix(~x2), as.factor(x0), dist="poisson"))
attr(rc, "est") # expected values in factor levels
aggregate(exp(0.5 + 0.5*x1), list(x0=x0), mean) # true values

## simple example
oComb(1:4, "+")
## using estimates
oComb(attr(rc, "est"))

Partitioning for multiple species

The opticut function can take matrices or a model formula as its input, and repeats the procedure done by opticut1 for multiple species in a community matrix. The summary shows the best-supported model for each species, in this case based on a Poisson count model (dist = "poisson"). The plot method uses the indicator value (I in the summary) to represent the contrast between the two strata of the best supported binary partition:

## stratification
g <-    c(1,1,1,1, 2,2,2,2, 3,3,3,3)

## community matrix
y <- cbind(
    Sp1=c(4,6,3,5, 5,6,3,4, 4,1,3,2),
    Sp2=c(0,0,0,0, 1,0,0,1, 4,2,3,4),
    Sp3=c(0,0,3,0, 2,3,0,5, 5,6,3,4))

oc <- opticut(formula = y ~ 1, strata = g, dist = "poisson", comb = "all")
summary(oc)

plot(oc, cut = -Inf)

Model support across the partitions can be visualized by the model weight plot (wplot).

wplot(oc, cut = -Inf)

Note that the wplot uses 'splits' (binary partitions) whereas plot uses the $K$ levels. Therefore, wplot does not work for multiple species when comb = "rank". In this case the splits can be different across the species, whereas comb = "all" uses the same pre-defined partitions across all species.

Compare the "all" and "rank" based combinations: same best partitions putting aside complementaryity

op <- ocoptions(cut = -Inf)
summary(opticut(formula = y ~ 1, strata = g, dist = "poisson", comb = "all"))
summary(opticut(formula = y ~ 1, strata = g, dist = "poisson", comb = "rank"))
ocoptions(op)

Accessing models and partitions

The bestmodel method returns the best-supported model for further model diagnostics and prediction, the getMLE prints out the estimated coefficients and the variance-covariance matrix:

mods <- bestmodel(oc)
mods
## explore further
str(predict(mods[[1]]))
confint(mods[[1]])

## MLE and variance-covariance matrix (species 1)
getMLE(oc, which = 1)

The bestpart method returns the binary partition for the best-supported model:

bestpart(oc)

Quantifying uncertainty

Uncertainty in the estimated coefficients and uncertainty in the derived indicator value for the best-supported model ($I^{(m')}$) is quantified based on the estimate of the Hessian matrix assuming asymptotic normality of the MLE. The distribution of $I^{(m')}$ in this case is based on parametric bootstrap. This approach (type = "asymp" in uncertainty) is suitable when asymptotic normality assumption is reasonable, i.e. sample size is large. For small sample situations, a non-parametric bootstrap algorithm is implemented (type = "boot") to estimate uncertainty in $I^{(m')}$. The summary contains lower and upper confidence limits (for a given error rate) representing the asymptotic or bootstrap distribution (based on $B$ number of iterations) given the fixed partition for the best model, $z^{(m')}$.

The output is summarized, the $uncertainty component contains individual species results:

g <-    c(1,1,1,1, 2,2,2,2, 3,3,3,3)

y <- cbind(
    Sp1=c(4,6,3,5, 5,6,3,4, 4,1,3,2),
    Sp2=c(0,0,0,0, 1,0,0,1, 4,2,3,4),
    Sp3=c(0,0,3,0, 2,3,0,5, 5,6,3,4))

oc <- opticut(formula = y ~ 1, strata = g, dist = "poisson")

uc <- uncertainty(oc, type = "asymp", B = 999)
summary(uc)
uc$uncertainty

The bootstrap can be difficult for small sample sizes, strata can go completely missing:

try(uc <- uncertainty(oc, type = "boot", B = 99))

A general requirement for the bootstrap approach ("boot" and "multi") is that the bootstrap samples contain observations from each stratum. We recommend having at least 5 observations per strata. Possible problems with missing partitions in small data sets can be remedied by supplying pre-defined indices for resampling, for example, based on jackknife (leave-one-out) approach. The resampling scheme can be customized for such needs. Use the check_strata function:

B <- sapply(1:length(g), function(i) which((1:length(g)) != i))
check_strata(oc, B) # check representation
summary(uncertainty(oc, type = "boot", B = B))

The reliability of the best partition can also be assessed using the setting type = "multi" (as in multiple models). In this case, the model partitions are re-evaluated for each bootstrap sample. Model uncertainty is assessed as the number of times a partition is supported out of the $B$ bootstrap runs ($b=1,\ldots,B$). The reliability ($R$) metric in the summary is the proportion for the most frequently supported partition. The corresponding indicator value and confidence interval is conditional on this most commonly supported partition.

summary(ucm <- uncertainty(oc, type = "multi", B = B))

The bestpart method returns the selection frequencies for the best-supported models (based on type = "multi" with comb = "rank"):

bestpart(ucm)

The bootstrap averaged ('bagged') expected values for each $k=1,\ldots,K$ stratum in $g$ can be accessed using the bsmooth method that calculates $E[Y_i \mid g_i = k] = \frac{1}{B} \sum_{b=1}^{B} f^{-1}({}^{(b)}\hat{\beta}_0^{(m')} + {}^{(b)}\hat{\beta}_1^{(m')} {}^{(b)}z_i^{(m')})$:

bsmooth(ucm)

Distributions

For most distributions, we will use the dolina data set that is part of the opticut package. It is a comprehensive and micro-scale land snail data set from 16 dolines of the Aggtelek Karst Area, Hungary. Data set containing land snail counts as described in Kemecei et al. 2014.

data(dolina)
## stratum as ordinal
dolina$samp$stratum <- as.integer(dolina$samp$stratum)
## filter species to speed up things a bit
Y <- dolina$xtab[,colSums(dolina$xtab > 0) >= 20]

Real valued data: Gaussian

dol <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="gaussian")
summary(dol)

## vertical plot orientation
plot(dol, horizontal=FALSE, pos=1, upper=0.8)

Nonnegative counts: Poisson and Negative Binomial

dol <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="poisson")
summary(dol)

## horizontal plot orientation
plot(dol)

Because opticut uses the stats::glm function to fit the Poisson model, it accepts other arguments, e.g. offsets. Let's subset the dolina data set for the litter sampling method ("Q"). Pool the abundances in each of the 16 dolines by microhabitat types. By doing this, we make sampling effort uneven. Litter microhabitat was sampled along a North-South transect (7 locations), whereas the other three strata (rock, live trees, dead wood) were sampled at 3 random locations in each dolina.

DQ <- dolina$samp[dolina$samp$method == "Q",]
DQ$dol_mhab <- paste0(DQ$dolina, "_", DQ$mhab)
head(DQ)
YQ <- dolina$xtab[dolina$samp$method == "Q",]
YQ <- YQ[,colSums(YQ > 0) >= 20]
YQ <- mefa4::groupSums(YQ, 1, DQ$dol_mhab)
DQ <- mefa4::nonDuplicated(DQ, dol_mhab, TRUE)

Let's compare the results of ignoring sampling effort differences with a case when we use sampling effort as offset. Offsets are defined as log(7) or log(3) representing the sampling volume differences (more debris searched leads to more snails found). Using offsets results in less species that is associated with litter:

op <- ocoptions(collapse="_", sort=FALSE, cut=-Inf)
dol0 <- opticut(YQ, strata=DQ$mhab, dist="poisson")
off <- ifelse(DQ$mhab == "LI", log(7), log(3))
dol1 <- opticut(YQ, strata=DQ$mhab, dist="poisson", offset=off)
table(wo_offset=summary(dol0)$summary$split,
    with_offset=summary(dol1)$summary$split)
ocoptions(op)
opar <- par(mfrow=c(1,2))
plot(dol0, show_I=FALSE, show_S=FALSE, sort=FALSE, cex.axis=0.6)
plot(dol1, show_I=FALSE, show_S=FALSE, sort=FALSE, cex.axis=0.6)
par(opar)

The Negative Binomial model can be quite picky, as it gives an error somewhere in the middle of the process:

try(dol <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="negbin"))

Changing the try_error global option will allow the process to go on after catching the error:

op <- ocoptions(try_error = TRUE)
dol <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="negbin")
dol$failed
ocoptions(op)

It failed for the "mbor" (Macrogastra borealis) species, which is now excluded from the output.

Zero-inflated count distributions

The Zero-inflated Negative Binomial implementation in the pscl package seems more robust, no error messages. We use the dist = "zinb2" option so that we test for optimal partitioning in the zero-inflation (ZI) component. Using a mixture distribution can be important if 0s can occur not only as a result of the ZI component, but due to low abundance in the count distribution (Poisson or Negative Binomial). Differentiating among these different types of zeros is not possible by binarizing the data and using logistic regression.

In the case of "zip2" and "zinb2" distributions, the coefficients refer to the probability of non-zero, so that positive and negative effects are properly identified (as opposed to "zip" and "zinb" where the ZI coefficients refer to probability of zero):

dol <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="zinb2")
plot(dol)

Presence-absence data: Binomial

The following example uses the dolina data set and illustrates how the link function can be specified in the Binomial case:

## dolina example
data(dolina)
## stratum as ordinal
dolina$samp$stratum <- as.integer(dolina$samp$stratum)
## filter species to speed up things a bit
Y <- ifelse(dolina$xtab[,colSums(dolina$xtab > 0) >= 20] > 0, 1, 0)
## opticut results, note the cloglog link function
dol <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="binomial:cloglog")

## parallel computing for uncertainty
ucdol <- uncertainty(dol, type="multi", B=25)

bestpart(ucdol)
heatmap(t(bestpart(ucdol)), scale="none", col=occolors()(25),
    distfun=function(x) dist(x, "manhattan"))

## See how indicator value changes with different partitions
with(ucdol$uncertainty[["pvic"]],
    boxplot(I ~ best, col="gold", ylab="Indicator value"))

Percent cover data: Binomial and Beta distribution

The data we are using is from the rioja package (Juggins 2015) is pollen stratigraphic data from Abernethy Forest, Scotland, spanning approximately 5500--12100 BP (from Birks & Mathews 1978).

library(rioja)
data(aber)
strat.plot(aber$spec, aber$ages$Depth, scale.percent=TRUE, y.rev=TRUE)

We use the Beta distribution for the stratigraphy example. The only limitation of the Beta distribution in this context is that boundary values (0 and 1) are not part of the support. One can use a tiny value (e.g. 0.0001). We are going to focus on species with at least 5 percent maximum values in the data set.

z <- as.factor(cut(aber$ages$Depth, 5))
ab <- as.matrix(aber$spec) / 100
ab[ab == 0] <- 0.0001
ab <- ab[,apply(ab, 2, max) > 0.05]

a <- opticut(ab, strata=z, comb="rank", dist="beta")
summary(a)

The plot shows results in a vertical orientation, resembling the stratigraphy plot.

plot(a, sort=FALSE, horizontal=FALSE, pos=1, upper=0.8,
    show_I=FALSE, show_S=FALSE, mar=c(6,6,1,1), xlab="", ylab="")

Let us combine together the best partitions with the raw data:

bp <- bestpart(a)
opar <- par(mfrow=c(3,4), mar=c(2,2,1,1))
for (i in 1:12) {
    plot(ab[,i], aber$ages$Depth, type="l", ann=FALSE)
    segments(x0=rep(0, nrow(ab)), y0=aber$ages$Depth, x1=ab[,i],
        col=ifelse(bp[,i] > 0, 2, 1))
    title(main=colnames(ab)[i])
}
par(opar)

The following data set is from the optpart package (Roberts 2016b): vascular plant species cover for the Shoshone National Forest, Wyoming, USA. We will try to find species associated with different elevations.

First we try a Binomial model for 0/1 data. Note that we set the try_error option to TRUE so the function will not stop when it runs into an error.

library(optpart)

data(shoshsite)
data(shoshveg)

elev <- cut(shoshsite$elevation, breaks=c(0, 7200, 8000, 9000, 20000))
levels(elev) <- c("low","mid1","mid2", "high")

sveg <- as.matrix(shoshveg)
sveg[sveg > 0] <- 1
o <- opticut(sveg ~ 1, strata=elev, dist="binomial")
plot(o, sort=1, mar=c(4,6,2,2), cex.axis=0.5, ylab="")

Next, we use the same data set as percent cover data and specify a Beta distribution:

sveg <- as.matrix(shoshveg)
sveg <- sveg[,colSums(sveg>0) >= 50]
table(sveg)
sveg[sveg==0] <- 0.001
sveg[sveg==0.1] <- 0.01
sveg[sveg==0.5] <- 0.05
sveg[sveg==1] <- 0.15
sveg[sveg==2] <- 0.25
sveg[sveg==3] <- 0.35
sveg[sveg==4] <- 0.45
sveg[sveg==5] <- 0.55
sveg[sveg==6] <- 0.65
sveg[sveg==7] <- 0.75
sveg[sveg==8] <- 0.8
table(sveg)
o2 <- opticut(sveg ~ 1, strata=elev, dist="beta")
plot(o2, sort=1, mar=c(4,6,2,2), cex.axis=0.6, ylab="")

Presence-only (use-availability) data

Presence only data is a result from animal movement studies, or based on museum specimens. Here is an example of Mountain Goat telemetry data that were collected in the Coast Mountains of northwest British Columbia, Canada, as described in Lele and Keim (2006).

Multiple species analysis is not possible due to different number of used (presence) points for each species. It leads to a ragged-array format which is not straightforward to supply through the formula interface. The analysis of such presence-only data is also subject to conditions as explained by Solymos & Lele (2016).

According to our expectations, Mountain Goats are associated with steeper slopes:

library(ResourceSelection)
data(goats)
slp <- cut(goats$SLOPE,c(0, 20, 30, 40, 50, 90), include.lowest=TRUE)
table(slp, goats$STATUS)
o <- opticut(STATUS ~ ELEVATION, data=goats, strata=slp, dist="rsf")
o$species
plot(o, sort=FALSE, show_S=FALSE)

Customizing the distribution

Me may want to expand the Zero-inflation component in a ZIP model. So instead of a constant zero probability, we may want to include covariate effects to account for some variation in the counts. See how the return value needs to be structured when supplying a function as distribution.

data(dolina)
dolina$samp$stratum <- as.integer(dolina$samp$stratum)
Y <- dolina$xtab[,colSums(dolina$xtab > 0) >= 20]

fun <- function(Y, X, linkinv, zi_term, ...) {
    X <- as.matrix(X)
    mod <- pscl::zeroinfl(Y ~ X-1 | zi_term, dist = "poisson", ...)
    list(coef=coef(mod),
        logLik=logLik(mod),
        linkinv=mod$linkinv)
}
Xdol <- model.matrix(~ stratum + lmoist + method, data=dolina$samp)
fun(Y[,"amin"], Xdol, zi_term=dolina$samp$method)
opticut1(Y[,"amin"], Xdol, Z=dolina$samp$mhab,
    dist=fun, zi_term=dolina$samp$method)
dol2 <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist=fun, zi_term=dolina$samp$method)
summary(dol2)

Mixed-effects models (LMM, GLMM)

Here is an example using mixed models and the package lme4 (Bates et al. 2015) with the dolina data. Pairs of samples were taken at each sampling location using 2 methods. The mwthod is now a fixed factor, while location is a random intercept (see Kemencei et al. 2014 for more explanation).

library(lme4)
lmefun <- function(Y, X, linkinv, gr, ...) {
    X <- as.matrix(X)
    m <- glmer(Y ~ X-1 + (1|gr), family=poisson("log"), ...)
    list(coef=fixef(m),
        logLik=logLik(m),
        linkinv=family(m)$linkinv)
}
lmefun(Y[,1], X=matrix(1, nrow(Y), 1), gr=dolina$samp$sample)

o <- opticut(Y[,"dper"] ~ method, data=dolina$samp, strata=dolina$samp$mhab, 
    dist=lmefun, gr=dolina$samp$sample)
print(o$species[[1]], cut=-Inf)

Generalized additive models (GAM)

We will use the gam function from the mgcv package (Wood 2006) and data of Ovenbird survey counts from Solymos et al. (2012).

First off, we define the function:

library(mgcv)
library(detect)
data(oven)
oven$veg <- factor(NA, c("open","decid","conif"))
oven$veg[oven$pforest < 0.5] <- "open"
oven$veg[oven$pforest >= 0.5 & oven$pdecid >= 0.5] <- "decid"
oven$veg[oven$pforest >= 0.5 & oven$pdecid < 0.5] <- "conif"
table(oven$veg, useNA="always")
oven$xlat <- scale(oven$lat)
oven$xlong <- scale(oven$long)
gamfun <- function(Y, X, linkinv, Data, ...) {
    X <- as.matrix(X)
    m <- mgcv::gam(Y ~ X-1 + s(xlat) + s(xlong), Data, ...)
    list(coef=coef(m),
        logLik=logLik(m),
        linkinv=family(m)$linkinv)
}

Try it on a single partition: agriculture vs. not:

x <- ifelse(oven$veg=="agr",1,0)
X <- model.matrix(~x)
gamfun(oven$count, X, Data=oven, family=poisson)
print(opticut1(oven$count, X=X[,1,drop=FALSE], oven$veg, dist=gamfun,
    Data=oven, family=poisson), cut=-Inf)

Poisson count example with GAM:

o <- opticut(count ~ 1, oven, strata=veg, dist=gamfun, Data=oven, family=poisson)
summary(o)
o <- opticut(count ~ 1, oven, strata=veg, dist="poisson")
summary(o)

Binomial GAM:

oven$pa <- ifelse(oven$count > 0, 1, 0)
o <- opticut(pa ~ 1, oven, strata=veg, dist=gamfun, Data=oven, family=binomial)
summary(o)
o <- opticut(pa ~ 1, oven, strata=veg, dist="binomial")
summary(o)

Incorporating detectability: N-mixture models

A single-visit based N-mixture is an example where detection error is estimated (Solymos et al. 2012, Solymos & Lele 2016). We will use the Ovenbird data from the detect package (Solymos et al. 2016).

svfun <- function(Y, X, linkinv, ...) {
    X <- as.matrix(X)
    m <- detect::svabu(Y ~ X-1 | Xdet-1, ...)
    list(coef=coef(m),
        logLik=logLik(m),
        linkinv=poisson()$linkinv)
}
## detection model
oven$xjulian <- scale(oven$julian)
Xdet <- model.matrix(~ xjulian, oven)
svfun(oven$count, X=model.matrix(~ xlat + xlong, oven))

Let us compare results based on naive GLM and N-mixture:

## naive GLM
o1 <- opticut(count ~ xlat + xlong, oven, strata=veg, dist="poisson")
print(o1$species[[1]], cut=-Inf)
## N-mixture
summary(svabu(count ~ xlat + veg | xjulian, data=oven))
o2 <- opticut(count ~ xlat + xlong, oven, strata=veg, dist=svfun)
print(o2$species[[1]], cut=-Inf)

Package options

High performance computing

data(dolina)
dolina$samp$stratum <- as.integer(dolina$samp$stratum)
Y <- ifelse(dolina$xtab > 0, 1, 0)
dol <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="binomial")

## parallel computing comparisons
library(parallel)
cl <- makeCluster(2)

## sequential, all combinations (2^(K-1) - 1)
system.time(opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="binomial", comb="all", cl=NULL))

## sequential, rank based combinations (K - 1)
system.time(opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="binomial", comb="rank", cl=NULL))

## parallel, all combinations (2^(K-1) - 1)
system.time(opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="binomial", comb="all", cl=cl))

## parallel, rank based combinations (K - 1)
system.time(opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="binomial", comb="rank", cl=cl))

## compare timings for uncertainty calculations
system.time(uncertainty(dol, type="asymp", B=999))
system.time(uncertainty(dol, type="asymp", B=999, cl=cl))

system.time(uncertainty(dol, type="boot", B=4))
system.time(uncertainty(dol, type="boot", B=4, cl=cl))

system.time(uncertainty(dol, type="multi", B=4))
system.time(uncertainty(dol, type="multi", B=4, cl=cl))

stopCluster(cl)

Global options

The ocoptions function provides a convenient way of handling options related to the opticut package. The function takes arguments in <tag> = <value> form, or a list of tagged values. The tags must come from the parameters described below. When parameters are set by ocoptions, their former values are returned in an invisible named list. Such a list can be passed as an argument to ocoptions to restore the parameter values. Tags are the following:

## simple example from Legendre 2013
## Indicator Species: Computation, in
## Encyclopedia of Biodiversity, Volume 4
## http://dx.doi.org/10.1016/B978-0-12-384719-5.00430-5
gr <- as.factor(paste0("X", rep(1:5, each=5)))
spp <- cbind(Species1=rep(c(4,6,5,3,2), each=5),
    Species2=c(rep(c(8,4,6), each=5), 4,4,2, rep(0,7)),
    Species3=rep(c(18,2,0,0,0), each=5))
rownames(spp) <- gr

## current settings
str(ocoptions()) # these give identical answers
str(getOption("ocoptions"))
summary(ocall <- opticut(spp ~ 1, strata=gr, dist="gaussian", comb="all"))

## resetting pboptions and checking new settings
ocop <- ocoptions(collapse="&", sort=FALSE)
str(ocoptions())
## running again with new settings
summary(ocall <- opticut(spp ~ 1, strata=gr, dist="gaussian", comb="all"))

## resetting original
ocoptions(ocop)
str(ocoptions())

Color themes

The opticut package uses color themes for plotting and provides a convenient way of setting color palettes via the occolors function. The function takes a single theme argument and returns a function as in colorRampPalette.

The theme argument can be a character value, character vector, or a function used to interpolate the colors. The built-in values are "br" (blue-red divergent palette, colorblind safe), "gr" (green-red divergent palette), "bw" (black and white). Hexadecimal values for the built-in palettes are taken from http://colorbrewer2.org.

plot(1:100, rep(2, 100), pch = 15,
    ylim = c(0, 21), axes = FALSE, ann = FALSE,
    col = occolors()(100)) # default 'bg'
text(50, 1, "theme = 'br'")
points(1:100, rep(5, 100), pch = 15,
    col=occolors("gr")(100))
text(50, 4, "theme = 'gr'")
points(1:100, rep(8, 100), pch = 15,
    col=occolors("bw")(100))
text(50, 7, "theme = 'bw'")
points(1:100, rep(11, 100), pch = 15,
    col=occolors(terrain.colors)(100))
text(50, 10, "theme = terrain.colors")
points(1:100, rep(14, 100), pch = 15,
    col=occolors(c("purple", "pink", "orange"))(100))
text(50, 13, "theme = c('purple', 'pink', 'orange')")
points(1:100, rep(17, 100), pch = 15,
    col=occolors(c("#a6611a", "#ffffbf", "#018571"))(100))
text(50, 16, "theme = c('#a6611a', '#ffffbf', '#018571')")
points(1:100, rep(20, 100), pch = 15,
    col=occolors(c("#7b3294", "#ffffbf", "#008837"))(100))
text(50, 19, "theme = c('#7b3294', '#ffffbf', '#008837')")

Progress bar

The expected completion time of extensive calculations and the progress is shown by the progress bar via the pbapply package. Default options with opticut are:

str(pboptions())

See ?pboptions for a description of these options. Use pboptions(type = "none") to turn off the progress bar in interactive R sessions. The progress bar is automatically turned off during non-interactive sessions.

Dynamic documents

opticut object summaries come with an as.data.frame method that can be used to turn the summary into a data frame, which is what for example the kable function from knitr package expects. This way, formatting the output is much facilitated, and the user does not have to dig into the structure of the summary object.

The GitHub repository has a minimal Rmarkdown example do demonstrate how to format opticut objects for best effects: Rmd source, knitted PDF.

library(knitr)

y <- cbind(
    Sp1=c(4,6,3,5, 5,6,3,4, 4,1,3,2),
    Sp2=c(0,0,0,0, 1,0,0,1, 4,2,3,4),
    Sp3=c(0,0,3,0, 2,3,0,5, 5,6,3,4))
g <-    c(1,1,1,1, 2,2,2,2, 3,3,3,3)
oc <- opticut(formula = y ~ 1, strata = g, dist = "poisson")
uc <- uncertainty(oc, type = "asymp", B = 999)

print(kable(as.data.frame(oc), digits=3))
print(kable(oc$species[[1]][,c(1,2,4,5,8,9,10)], digits=3))
print(kable(as.data.frame(uc), digits=3))

The kable output is rendered as nice tables (without the print part):

kable(as.data.frame(oc), digits=3)
kable(oc$species[[1]][,c(1,2,4,5,8,9,10)], digits=3)
kable(as.data.frame(uc), digits=3)

Summary

The likelihood-based optimal partitioning framework implemented in the opticut R package provides a compelling alternative to other available R packages (indicspecies, De Caceres & Legendre 2009; labdsv and optpart, Roberts 2016a, 2016b; vegan, Oksanen et al. 2016), especially when the type of data or the presence of modifying effects call for an alternative approach.

The package comes with many parametric models included for binary, count, abundance, percent cover, ordinal, and presence-only data, and the approach can be extended to more complex situations, such as mixed models, additive models. The opticut package leverages other R packages (MASS, Venables & Ripley 2002; betareg, Cribari-Neto & Zeileis 2010; pscl, Zeileis et al. 2008; ResourceSelection, Lele et al. 2016) for fitting parametric models. The approach can be extended to linear and generalized linear mixed models (LMM, GLMM; see Kemencei et al. 2014), generalized additive models, N-mixture models.

Computing times are shortened by the application of efficient algorithms and through high performance computing options. The opticut package provides progress bars with estimated remaining time for long-running evaluations (through the pbapply package, Solymos & Zawadzki 2016), natively supports several parallel back-ends (multicore machines, computing clusters, forking; through the parallel package, R Core Team 2016) to speed calculations up, and provides options and methods for dynamic report generation (coercion methods and color themes). Functions in the package can be customized and extended to meet the needs under a wide range of real-world situations.

Please cite the opticut package in scholarly publications as:

Peter Solymos and Ermias T. Azeria (2016). opticut: Likelihood Based Optimal Partitioning for Indicator Species Analysis. R package version \<insert appropriate version number here>. https://github.com/psolymos/opticut

Use citation("opticut") for an up-to-date citation.

References

Bates, D., Maechler, M., Bolker, B. & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67(1): 1--48.

Birks, H. H. & Mathews, R. W. (1978). Studies in the vegetational history of Scotland V. Late Devensian and early Flandrian macrofossil stratigraphy at Abernethy Forest, Invernessshire. New Phytologist 80: 455--84.

Chytry, M., Tichy, L., Holt, J. & Botta-Dukat, Z. (2002). Determination of diagnostic species with statistical fidelity measures. Journal of Vegetation Science 13: 79--90.

Cribari-Neto, F. & Zeileis, A. (2010). Beta Regression in R. Journal of Statistical Software 34(2): 1--24. http://www.jstatsoft.org/v34/i02/.

De Caceres, M. & Legendre, P. (2009). Associations between species and groups of sites: Indices and statistical inference. Ecology 90: 3566--3574.

Dufrene, M. & Legendre, P. (1997) Species assemblages and indicator species: The need for a flexible asymmetrical approach. Ecological Monographs 67: 345--366.

Halme, P., Monkkonen, M., Kotiaho, J. S, Ylisirnio, A-L. (2009). Quantifying the indicator power of an indicator species. Conservation Biology 23: 1008--1016.

Juggins, S. (2015). rioja: Analysis of Quaternary Science Data, R package version 0.9-9. http://cran.r-project.org/package=rioja

Kemencei, Z., Farkas, R., Pall-Gergely, B., Vilisics, F., Nagy, A., Hornung, E. & Solymos, P. (2014). Microhabitat associations of land snails in forested dolinas: implications for coarse filter conservation. Community Ecology 15: 180--186.

Lele, S. R., Keim, J. L. & Solymos, P. (2016). ResourceSelection: Resource Selection (Probability) Functions for Use-Availability Data. R package version 0.3-0. https://CRAN.R-project.org/package=ResourceSelection

McGeoch, M. A. & Chown, S. L. (1998). Scaling up the value of bioindicators. Trends in Ecology and Evolution 13: 46--47.

Oksanen, J., Blanchet, F. G., Friendly, M., Kindt, R., Legendre, P., McGlinn, D., Minchin, P. R., O'Hara, R. B., Simpson, G. L., Solymos, P., Stevens, M. H. H., Szoecs, E. & Wagner, H. (2016). vegan: Community Ecology Package. R package version 2.5-0. https://CRAN.R-project.org/package=vegan

R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Roberts, D. W. (2016a). labdsv: Ordination and Multivariate Analysis for Ecology. R package version 1.8-0. https://CRAN.R-project.org/package=labdsv

Roberts, D. W. (2016b). optpart: Optimal Partitioning of Similarity Relations. R package version 2.3-0. https://CRAN.R-project.org/package=optpart

Solymos, P., Moreno, M. & Lele, S. R. (2016). detect: Analyzing Wildlife Data with Detection Error. R package version 0.4-0. https://github.com/psolymos/detect

Solymos, P., Lele, S. R. & Bayne, E. (2012). Conditional likelihood approach for analyzing single visit abundance survey data in the presence of zero inflation and detection error. Environmetrics 23: 197--205.

Solymos, P. & Lele, S. R. (2016). Revisiting resource selection probability functions and single-visit methods: clarification and extensions. Methods in Ecology and Evolution 7: 196--205.

Solymos, P. & Zawadzki, Z. (2016). pbapply: Adding Progress Bar to '*apply' Functions. R package version 1.3-2. https://github.com/psolymos/pbapply

Venables, W. N. & Ripley, B. D. (2002). Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

Wildi, O. & Feldmeyer-Christe, E. (2013). Indicator values (IndVal) mimic ranking by F-ratio in real-world vegetation data. Community Ecology 14(2): 139--143.

Wood, S. N. (2006). Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC.

Zeileis, A., Kleiber, C. & Jackman, S. (2008). Regression Models for Count Data in R. Journal of Statistical Software, 27(8), 1-25. http://www.jstatsoft.org/v27/i08/

Zettler, M. L., Proffitt, C. E., Darr, A., Degraer, S., Devriese, L., Greathead, C., Kotta, J., Magni, P., Martin, G., Reiss, H., Speybroeck, J., Tagliapietra, D., Van Hoey, G. & Ysebaert, T. (2013). On the Myths of Indicator Species: Issues and Further Consideration in the Use of Static Concepts for Ecological Applications. PLoS ONE, 8(10):e78219.



psolymos/opticut documentation built on Nov. 27, 2022, 11:29 a.m.