In psolymos/opticut: Likelihood Based Optimal Partitioning and Indicator Species Analysis

Introduction

Identification and monitoring of indicator species has long been thought as useful and cost-effective to monitor changes in environmental conditions or status of habitat for biota. Examples of indicator species application include: to characterize certain habitats or vegetation types, to indicate naturalness or degradation of ecosystems, to measure success of habitat restoration, alerting about critical community thresholds, aid in survey and monitoring design, or indicate the presence of cryptic or rare species.

A key attribute of indicator species (also referred also as character or differential) is that they have strong association with the environment or any factor they are supposed to indicate. Approaches to quantify the degree of environmental associations for species or indicator value of species fall into three major types of algorithms:

contingency table based measures,
analysis of variance (ANOVA) based measures, and
the widely used, non-parametric IndVal approach.

While the different approaches have a strong appeal and applications, they do not always meet the challenges ecological data present us. Thus, ecological data come in different forms distribution types (typically can be of binary, ordinal, abundance or presence only data) whose distribution assumptions might not always be handled properly by these methods. For example, the contingency table is based on binary (presence/absence data) or arbitrary categorization of abundance data; this will effectively cause the loss of information when available abundance data is binarized to meet the analysis framework.
In this regard, the ANOVA approach is better suited to handle abundance data; however, the parametric assumptions of the ANOVA imply normality and homoskedastic errors, which might not always be satisfied in most field situations (e.g. using 0/1, skewed biomass, or % cover data). Non-parametric approach such as IndVal index that does not assume normality assumption could handle data of various distribution types. However, the randomization test which is used to test species-environment associations might not always be meaningful for continuous (e.g. biomass, or % cover) or for ordinal (scores for vegetation cover) response data.

Besides, none of these approaches are designed to deal with some other aspects of field data, for example confounding variables, sample selection bias (presence-only data), sampling effort differences, or imperfect detection. The issue of confounding variable could be particularly significant in many field situations. in that if the effect other natural environmental gradients or regional factors are not controlled for, the assessment of indicative value of a species for a particular condition of interest could be compromised. For example, the absence or low abundance of a 'good' indicator species from a given site might render the site in 'bad' condition; however, the species population status might be related to other naturally occurring variability rather than anthropogenic caused stressors. Thus, having an option to control for putative confounding factors can improve the assessment of indicator value of species.

In this paper we introduce opticut (Optimal Cut) an R-package developed to addresses the most common limitations of currently available methods. The limitation are (1) the absence of a common framework that can meet the different distribution assumptions of field data and (2) the option to control for putative confounding factors and evaluate how indicator value of species might change under different conditions. The critical development in Opticut to address these limitations is the implementation of a general and extensible likelihood-based framework and multi-model inference for indicator species analysis. Moreover, the opticut R extension package offers computationally efficient algorithms for finding indicator species, and tools for exploring and visualizing the results.

Likelihood based optimal partitioning for indicator species analysis. Finding the best binary partition for each species based on model selection, possibly controlling for modifying/confounding variables as described in Kemencei et al. (2014, Community Ecology 15:180--186). The package also implements various measures of uncertainty based on binary partitions, optimal multinomial partitioning, and exploratory selection indices, with native support for parallel computations.

describe a general and extensible approach that addresses the above limitations,
implement a computationally efficient algorithm,
tools for exploring the results (i.e. summaries, plots) in a object oriented framework.

Why opticut?

HPC is natively supported
efficient for large number of partitions
lots of models defined, extensible
uncertainty and partitioning reliability

Theory

The quest for optimal binary partitioning

$Y_{i}$'s are observations for a single species from $n$ locations ($i = 1, ..., n$). $g_{i}$'s are known discrete descriptors of the locations with $K$ levels ($K > 2$). $z^{(m)}$ is a binary reclassification of $g$ taking values (0, 1). The superscript $m = 1, ..., M$ indicates a possible combination of binary reclassification out of the total $M = 2^{K-1} - 1$ total combinations (excluding complements). See below for options for defining binary partitions. There can also be other site descriptors denoted as $x_{ij}$ taking discrete or continuous values ($j = 1, ..., p$; number of predictors).

A suitable parametric model describes the relationship between the observations and the site descriptors through the probability density function $P(Y_{i} = y_{i} \mid z_{i}^{(m)}, x_{ij}, \theta)$ where $\theta$ is the vector of model parameters: $\theta = (\beta_{0}, \beta_{1}, \alpha_{1}, ..., \alpha_{p})$. The choice of the parametric model depends on the nature of the observations. It can be Gaussian, Binomial, Poisson, ordinal, Beta regression, or zero-inflated models, with a suitable link function ($f$) for the mean: $f(\eta_{i}) = \beta_{0}^{(m)} + \beta_{1}^{(m)} z_{i}^{(m)} + \sum_{j=1}^{p} \alpha_{j}^{(m)} x_{ij}$.

$\widehat{\theta^{(m)}}$ is the maximum likelihood estimate (MLE) of the model parameters given the data and classification $m$, with corresponding log-likelihood value $l(\widehat{\theta^{(m)}}; y)$. Finding MLEs for all $M$ candidate binary partitions leads to a set of log-likelihood values. One can compare the log-likelihood values to a null model (no binary partition is necessary) where $\beta_{1} = 0$ leading to the MLE $\widehat{\theta^{(0)}}$ and corresponding log-likelihood value for the null model: $l(\widehat{\theta^{(0)}}; y)$.

The log-likelihood ratio for each candidate partition can be calculated as $l(\widehat{\theta^{(m)}}; y) - l(\widehat{\theta^{(0)}}; y)$. The best supported binary partition is the model with the highest log-likelihood ratio value.

One way of calculating the indicator value for each candidate partition is based on expected values using the inverse link function as $\mu_{0}^{(m)} = f^{-1}(\beta_{0}^{(m)})$ and $\mu_{1}^{(m)} = f^{-1}(\beta_{0}^{(m)} + \beta_{1}^{(m)})$. $I = 1 - min(\mu_{0}^{(m)}, \mu_{1}^{(m)}) / max(\mu_{0}^{(m)}, \mu_{1}^{(m)})$. Where $\mu_{0}^{(m)} = E[Y_{i} \mid z_{i}^{(m)}=0, x_{ij}=0]$ and $\mu_{1}^{(m)} = E[Y_{i} \mid z_{i}^{(m)}=1, x_{ij}=0]$ are expected values for the observations given the binary partition $z_{i}^{(m)}$ and at 0 value for all $x_{ij}$. This approach can be sensitive to the range of values supported by the link function. For example it works nicely with logarithmic or logistic link function where non-negativity of predicted values is ensured by definition. This is, however, not the case for the identity link in the Gaussian case, when negative values can invaludate the indicator value calculations as described above. (This usually happens when confounding variables are not centered and the intercept then reflects that difference as part of the baseline.)

As an alternative, one can use the estimate $\beta_{1}^{(m)}$ itself to express the contrast between the two strata. This also makes the index more comparable when different link functions are used. We used the hyperbolic tangent function (or inverse Fisher's $z$ transform) to scale the real valued $\beta_{1}^{(m)}$ into the unit range (0-1): $I = tanh(\mid \beta_{1}^{(m)} \mid) = \frac{exp(2 \mid \beta_{1}^{(m)} \mid) - 1}{exp(2 \mid \beta_{1}^{(m)} \mid) + 1}$. Positive and negative cases are taken as absolute values, so that the index reflects only the contrast between strata, and not the direction of it. Negative value can happen when using all combinations.

Finding all possible binary partitions

Finding all combinations does not require a model or observed responses. It only takes a classification vector with $K > 1$ partitions.

kComb returns a 'contrast' matrix corresponding to all possible binary partitions of the factor with K levels. Complements are not counted twice, i.e. (0,0,1,1) is equivalent to (1,1,0,0). The number of such possible combinations is $M = 2^{K-1} - 1$.

Get the package and load it:

#devtools::install_github("psolymos/opticut")
#devtools::install("~/repos/opticut")
#devtools::check("~/repos/opticut")
#devtools::build("~/repos/opticut", binary=TRUE)
library(opticut)

kComb(k = 2)
kComb(k = 3)
kComb(k = 4)

allComb this takes a classification vector with at least 2 levels and returns a model matrix with binary partitions. checkComb checks if combinations are unique and non-complementary (misfits are returned as attributes).

(f <- rep(LETTERS[1:4], each=2))
(mc <- allComb(f, collapse = "_"))
checkComb(mc)
mc2 <- cbind(z = 1 - mc[,1], mc[,c(1:ncol(mc), 1)])
colnames(mc2) <- 1:ncol(mc2)
mc2
checkComb(mc2)

Poisson count model example

set.seed(1234)
n <- 200
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
table(x0,x1)
lam1 <- exp(0.5 + 0.5*x1 + -0.2*x2)
boxplot(lam1~x0)
Y1 <- rpois(n, lam1)
lam2 <- exp(0.1 + 0.5*ifelse(x0==4,1,0) + 0.2*x2)
boxplot(lam2~x0)
Y2 <- rpois(n, lam2)
lam3 <- exp(0.1 + -0.2*x2)
boxplot(lam3~x0)
Y3 <- rpois(n, lam3)
Y <- cbind(SPP1=Y1, SPP2=Y2, SPP3=Y3)
X <- model.matrix(~x2)
Z <- allComb(x0)
opticut1(Y1, X, Z, dist="poisson")
opticut1(Y2, X, Z, dist="poisson")
opticut1(Y3, X, Z, dist="poisson")
summary(m <- opticut(Y ~ x2, strata=x0, dist="poisson", comb="all"))
plot(m, cut=-Inf)

Describe here what is what in the output.

Not using all possible partitions

Blindly fitting a model to all possible partitions is wasteful use of resources. Instead, one can rank the $K$ partitions based on expected response values ($\mu_{1}, ..., \mu_{k}, ..., \mu_{K}$, where $\mu_{k}=E[Y_{i} \mid g_{i}=k, x_{ij}=0]$). This way we have to explore only $K-1$ partitions:

oComb(1:4)

oComb return the 'contrast' matrix based on the rank vector as input. Rank 1 means lowest expected value among the partitions.

The function rankComb fits the model with multiple ($K > 2$) factor levels to find out the ranking, and returns a binary classification matrix similarly to allComb:

head(rc <- rankComb(Y1, model.matrix(~x2), as.factor(x0), dist="poisson"))
attr(rc, "est")

Note that the ranking varies from species to species, thus it is not possible to supply the resulting matrix as strata definition:

summary(opticut(Y ~ x2, strata=x0, dist="poisson", comb="rank"))

There is an overhead of fitting the model to calculate the ranking. But computing efficiencies can be still high compared to all partitions when the number of levels ($k$) is high.

A downside of this approach is that not all possible partitions are explored, thus the model weights do not represent all possible models, but only the top candidates. Thus model weight interpretation is different (i.e. cannot be used as a reliability matric, especially when support for the best model is not dominant).

Distributions

Currently available distributions:

"gaussian": real valued continuous observations, e.g. biomass,
"poisson": Poisson count data,
"binomial": presence-absence type data,
"negbin": overdispersed Negative Binomial count data,
"beta": continuous response in the unit interval, e.g. percent cover,
"zip", "zip2": zero-inflated Poisson counts (partitioning in count model: "zip", or in zero model: "zip2"),
"zinb", "zinb": zero-inflated Negative Binomial counts (partitioning in count model: "zinb", or in zero model: "zinb2"),
"ordered": response measured on ordinal scale, e.g. ordinal vegetation cover,
"rsf", "rspf": presence-only data using resource selection and resource selection probability functions.

Gaussian

Y <- rnorm(n, log(lam1) + 10, 0.5)
(mod <- opticut(Y ~ x2, strata=x0, dist="gaussian"))

Legendre example

gr <- rep(1:5, each=5)
spp <- cbind(Sp1=rep(c(4,6,5,3,2), each=5),
    Sp2=c(rep(c(8,4,6), each=5), 4,4,2, rep(0,7)),
    Sp3=rep(c(18,2,0,0,0), each=5))
rownames(spp) <- gr
spp
summary(mod <- opticut(spp ~ 1, strata=gr, dist="gaussian", comb="all"))
summary(mod <- opticut(spp ~ 1, strata=gr, dist="gaussian", comb="rank"))

## DeCaceres & Legendre 2013 Oikos example from Fig 1
## Oikos 119: 1674-1684, 2010
## doi: 10.1111/j.1600-0706.2010.18334.x
Y <- c(0, 0, 3, 0, 2, 3, 0, 5, 5, 6, 3, 4)
z <- rep(1:3, each=4)
Z <- allComb(z)
Z <- cbind(Z, 1-Z)
colnames(Z) <- c("1", "2", "3", "2 3", "1 3", "1 2")
Z
try(opticut1(Y, Z=Z))
oc <- ocoptions(check_comb=FALSE, cut=-Inf) # relax the checks
opticut1(Y, Z=Z) # identical results for complementary partitions
ocoptions(oc) # restore defaults

print(opticut1(Y, Z=allComb(z)), cut=-Inf)
print(opticut1(Y, Z=as.factor(z)), cut=-Inf)

## figure example
y <- cbind(
    Spp1=c(4,6,3,5,5,6,3,4,4,1,3,2),
    Spp2=c(0,0,0,0,1,0,0,1,4,2,3,4),
    Spp3=c(0,0,3,0,2,3,0,5,5,6,3,4))
g <-    c(1,1,1,1, 2,2,2,2, 3,3,3,3)
x <- c(0.1,-0.2,1,0,-0.5,-1,0,0.5,0,0.8,-0.3,0.1)
m <- opticut(formula = y ~ 1, strata = g, dist = "poisson")
print(summary(m), cut=-Inf)
plot(m, ylab="", cut=-Inf, sort=FALSE, show_I=FALSE, show_S=FALSE)
## this breaks
set.seed(1)
try(u <- uncertainty(m, type = "multi"))
## see uncertainty examples
## for more sophisticated ways of dealing with this issue:
## e.g. jackknife
B <- sapply(1:length(g), function(i) which((1:length(g)) != i))
check_strata(m, B) # check representation
u <- uncertainty(m, type="multi", B=B)
summary(u)
bestpart(u)

BCI data

library (vegan)
data (BCI)
library (BiodiversityR)  # available from R version 2.15.1, not older!
data (BCI.env)
BCI.soil <- read.delim ('https://www.davidzeleny.net/anadat-r/lib/exe/fetch.php?media=data:bci.soil.txt')
###
BCI.hab <- read.table("https://www.kharms.biology.lsu.edu/TORUS_Habitats.txt",
    sep="\t", header=TRUE)

Binomial

set.seed(1234)
n <- 1000
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
table(x0,x1)
p1 <- plogis(0.5 + 0.5*x1 + -0.2*x2)
boxplot(p1~x0)
Y1 <- rbinom(n, 1, p1)
p2 <- plogis(0.1 + 0.5*ifelse(x0==4,1,0) + 0.2*x2)
boxplot(p2~x0)
Y2 <- rbinom(n, 1, p2)
Y <- cbind(SPP1=Y1, SPP2=Y2)
X <- model.matrix(~x2)
Z <- allComb(x0)

summary(opticut(Y ~ x2, strata=x0, dist="binomial"))

Poisson: Mite data set -- high performance computing

See computing time diffs and plotting options.

library(vegan)
data(mite)
data(mite.env)
mite.env$hab <- with(mite.env, interaction(Shrub, Topo, drop=TRUE))
summary(mod0 <- opticut(as.matrix(mite) ~ SubsDens, mite.env,
    strata=mite.env$hab, dist="poisson", comb="all"))
plot(mod0)

system.time(aa <- opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson", comb="rank"))
system.time(bb <- opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson", comb="all"))

## sequential
system.time(opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson"))
## parallel -- compare system times
library(parallel)
cl <- makeCluster(3)
system.time(opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson", cl=cl))
stopCluster(cl)
## forking -- will not work on Windows
if (!.Platform$OS.type == "windows")
system.time(opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson", cl=3))

Percentages

Dune data, cover type data as ordinal

See https://www.davidzeleny.net/anadat-r/doku.php/en:data:dune

library(vegan)
data(dune)
data(dune.env)
dune.env$manure <- as.integer(dune.env$Manure) - 1
dune.env$moisture <- as.integer(dune.env$Moisture) - 1

oc <- ocoptions(collapse="+", cut=-Inf)

## ordinal regr
## (when nlevels() < 3 use logistic regression instead !!!
#Dune <- as.matrix(dune)
#Dune <- Dune[,apply(Dune, 2, function(z) length(unique(z)))>2]
#x1 <- opticut(Dune ~ 1, dune.env, strata=Management, dist="ordered")
#summary(x1)

#plot(x1, mar=c(5,5,3,3))

## Binarizing data
Dune01 <- ifelse(as.matrix(dune)>0,1,0)
x2 <- opticut(Dune01 ~ 1, strata=dune.env$Management, dist="binomial")
summary(x2)
plot(x2)
x3 <- opticut(Dune01 ~ manure + moisture, dune.env, strata=dune.env$Management, dist="binomial")
summary(x3)
plot(x3)

## Beta regression
Dune2 <- as.matrix(dune+0.5) / 10
x4 <- opticut(Dune2 ~ 1, strata=dune.env$Management, dist="beta")
summary(x4)
plot(x4)
x5 <- opticut(Dune2 ~ manure, dune.env, strata=dune.env$Management, dist="beta")
summary(x5)
plot(x5)

xx <- data.frame(#ord0=summary(x1)$summary$split,
    bin0=summary(x2)$summary$split,
    binx=summary(x3)$summary$split,
    bet0=summary(x4)$summary$split,
    betx=summary(x5)$summary$split)
rownames(xx) <- rownames(summary(x2)$summary)
xx
ocoptions(oc)

Varespec data (% cover)

library(vegan)
data(varespec)
data(varechem)
y <- as.matrix(varespec / 100)
range(y[y>0])
y[y <= 0] <- 0.0001
y <- y[,apply(y, 2, max) > 0.05]
varechem$grazing <- as.factor(ifelse(rownames(varechem) %in% c(5,6,7,8,13,14,15,16,
    18,19,20,22,23,24,26), "grazed", "ungrazed"))

x <- opticut(y ~ 1, varechem, strata=grazing, dist="beta")
summary(x)
plot(x, cut=-Inf)

Implement ZI-Beta (quite unreliable for such small data set)

zi_beta_fun <- function(Y, X, linkinv, ...) {
    kx <- ncol(X)
    id1 <- Y > 0
    id0 <- !id1
    nll_ZIB_ML <- function(parms) {
        mu <- plogis(X %*% parms[1:kx])
        gamma <- exp(parms[kx + 1]) # precision
        phi <- plogis(parms[kx+2])
        alpha <- mu * gamma
        beta <- (1 - mu) * gamma
        loglik0 <- log(phi)
        loglik1 <- log(1 - phi) + suppressWarnings(dbeta(Y,
            alpha, beta, log = TRUE))
        loglik <- sum(loglik0[id0]) + sum(loglik1[id1])
        if (!is.finite(loglik) || is.na(loglik))
            loglik <- -.Machine$double.xmax^(1/3)
        -loglik
    }
    Yv <- Y
    Yv[Y <= 0.001] <- 0.001
    ini <- c(coef(betareg::betareg(Yv ~ .-1, data=X)), zi=-5)
    X <- as.matrix(X)
    res <- optim(ini, nll_ZIB_ML, ...)
    list(coef=res$par,
        logLik=-res$value,
        linkinv=binomial("logit")$linkinv)
}
y <- as.matrix(varespec / 100)
range(y[y>0])
y <- y[,apply(y, 2, max) > 0.05]
zi_beta_fun(y[,3], data.frame(matrix(1, nrow(y), 1)))
opticut1(y[,1], matrix(1, nrow(y), 1), varechem$grazing, dist=zi_beta_fun)

Stratigraphy example

library(rioja)
data(aber)
strat.plot(aber$spec, aber$ages$Depth, scale.percent=TRUE, y.rev=TRUE)

z <- as.factor(cut(aber$ages$Depth, 5))
ab <- as.matrix(aber$spec) / 100
ab[ab == 0] <- 0.0001
ab <- ab[,apply(ab, 2, max) > 0.05]

a <- opticut(ab ~ 1, strata=z, comb="rank", dist="beta")
summary(a)
plot(a, sort=FALSE, horizontal=FALSE, pos=1)
bp <- bestpart(a)

op <- par(mfrow=c(3,4), mar=c(2,2,1,1))
for (i in 1:12) {
    plot(ab[,i], aber$ages$Depth, type="l", ann=FALSE)
    segments(x0=rep(0, nrow(ab)), y0=aber$ages$Depth, x1=ab[,i],
        col=ifelse(bp[,i] > 0, 2, 1))
    title(main=colnames(ab)[i])
}
par(op)

Presence-only data

Describe RSF/RSPF differences especially related to covariates.

## presence-only data
## single species model only:
## because the used distr is different for
## each species by definition.

library(ResourceSelection)
## settings
n.used <- 1000
m <- 10
n <- n.used * m
set.seed(1234)
x <- data.frame(x0=as.factor(sample(1:3, n, replace=TRUE)),
    x1=rnorm(n), x2=runif(n))
cfs <- c(1, -0.5, 0.1, -1, 0.5)
## Logistic RSPF model
dd <- simulateUsedAvail(x, cfs, n.used, m, link="logit")

Y <- dd$status
X <- model.matrix(~ x1 + x2, dd)
Z <- allComb(as.integer(dd$x0))

mod1 <- opticut(Y ~ x1 + x2, dd, strata=x0, dist="rsf")
mod1$species
mod2 <- opticut(Y ~ x1 + x2, dd, strata=x0, dist="rspf")
mod2$species

Custom distributions

The distr argument accepts a function, so other parametric models can be supplied which are avoided due to package dependencies.

Mixed models

Here is an example using mixed models and the package lme4:

library(lme4)
set.seed(1234)
n <- 200
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
ee <- rnorm(n/5)
g <- rep(1:5, each=n/5)
lam1 <- exp(0.5 + 0.5*x1 + -0.2*x2 + ee[g])
Y1 <- rpois(n, lam1)

X <- model.matrix(~x2)
Z <- allComb(x0)

lmefun <- function(Y, X, linkinv, gr, ...) {
    X <- as.matrix(X)
    m <- glmer(Y ~ X-1 + (1|gr), family=poisson("log"), ...)
    list(coef=fixef(m),
        logLik=logLik(m),
        linkinv=family(m)$linkinv)
}
lmefun(Y1, X, gr=g)

opticut1(Y1, X, Z, dist=lmefun, gr=g)

Imperfect detectability: N-mixture case

A single-visit based N-mixture is an example where detection error is estimated. Let us compare results based on naive GLM and N-mixture:

library(detect)
set.seed(2345)
n <- 500
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
x3 <- runif(n, 0, 1)
lam <- exp(0.5 + 1*x1 + -0.2*x2)
p <- plogis(2 + -2*x3)
Y <- rpois(n, lam*p)

X <- model.matrix(~x2)

op <- par(mfrow=c(1,2))
boxplot((lam*p) ~ x0, ylab="lam*p", xlab="x0")
boxplot(lam ~ x0, ylab="lam", xlab="x0")
par(op)

svfun <- function(Y, X, linkinv, ...) {
    X <- as.matrix(X)
    m <- svabu(Y ~ X-1 | x3, ...)
    list(coef=coef(m, "sta"),
        logLik=logLik(m),
        linkinv=poisson()$linkinv)
}
svfun(Y, X)

## naive GLM
print(opticut1(Y, X, as.factor(x0), dist="poisson"), cut=-Inf)
## N-mixture
print(opticut1(Y, X, as.factor(x0), dist=svfun), cut=-Inf)

Sampling differences: using offsets

Not accounting for unequal sampling effort can be quite misleading, especially if that is related to habitat classes. This example shows how to take advantage of the other arguments passed to the ... in the opticut function.

set.seed(1234)
n <- 50
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
lam <- exp(0.5 + 1*x1 + -0.2*x2)
A <- ifelse(x0 %in% c(1,3), 1, 2)
Y <- rpois(n, lam*A)

op <- par(mfrow=c(1,2))
boxplot((lam*A) ~ x0, ylab="lam*A", xlab="x0")
boxplot(lam ~ x0, ylab="lam", xlab="x0")
par(op)

## no offset: incorrect
opticut(Y ~ x2, strata=x0, dist="poisson", comb="rank")$species
## with offsets: log Area
opticut(Y ~ x2, strata=x0, dist="poisson", offset=log(A), comb="rank")$species

GAM models

library(mgcv)
library(detect)
data(oven)
oven$veg <- factor(NA, c("agr","open","decid","conif", "mix"))
oven$veg[oven$pforest < 0.5] <- "open"
oven$veg[oven$pagri > 0.5 & oven$pforest < 0.5] <- "agr"
oven$veg[oven$pforest >= 0.5] <- "mix"
oven$veg[oven$pforest >= 0.5 & oven$pdecid >= 0.8] <- "decid"
oven$veg[oven$pforest >= 0.5 & oven$pdecid < 0.2] <- "conif"
table(oven$veg, useNA="always")
oven$xlat <- scale(oven$lat)
oven$xlong <- scale(oven$long)
gamfun <- function(Y, X, linkinv, Data, ...) {
    X <- as.matrix(X)
    m <- mgcv::gam(Y ~ X-1 + s(xlat) + s(xlong), Data, ...)
    list(coef=coef(m),
        logLik=logLik(m),
        linkinv=family(m)$linkinv)
}
x <- ifelse(oven$veg=="agr",1,0)
X <- model.matrix(~x)
gamfun(oven$count, X, Data=oven, family=poisson)
print(opticut1(oven$count, X=X[,1,drop=FALSE], oven$veg, dist=gamfun,
    Data=oven, family=poisson), cut=-Inf)

o <- opticut(count ~ 1, oven, strata=veg, dist=gamfun, Data=oven, family=poisson)
summary(o)
o <- opticut(count ~ 1, oven, strata=veg, dist="poisson")
summary(o)
oven$pa <- ifelse(oven$count > 0, 1, 0)
o <- opticut(pa ~ 1, oven, strata=veg, dist=gamfun, Data=oven, family=binomial)
summary(o)
o <- opticut(pa ~ 1, oven, strata=veg, dist="binomial")
summary(o)

Finding best partitions

It is useful to access the best binary partition

set.seed(2345)
n <- 50
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 %in% 1:2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
lam <- exp(0.5 + 1*x1 + -0.2*x2)
Y <- rpois(n, lam)
o <- opticut(Y ~ x2, strata=x0, dist="poisson", comb="rank")
summary(o)

bp <- bestpart(o)
head(bp)

The model based on the best partition can be returned as:

bestmodel(o, which=1)

the which argument can be used to subset the species.

Uncertainty

Uncertainty in $I$ values might be of interest. The type argument for the uncertainty method can take the following values:

"asymp": asymptotic distribution of $I$, $\mu_{0}$ and $\mu_{1}$ based on best partition found for the input object.
"boot": non-parametric bootstrap distribution of $I$, $\mu_{0}$ and $\mu_{1}$ based on best partition found for the input object.
"multi": non-parametric bootstrap distribution of $I$, $\mu_{0}$ and $\mu_{1}$ based on best partition found for the bootstrap data (i.e. the model ranking is re-evaluated each time).

uc1 <- uncertainty(o, type="asymp", B=5000)
uc2 <- uncertainty(o, type="boot", B=200)
uc3 <- uncertainty(o, type="multi", B=200)

uc1$uncertainty[[1]]
uc2$uncertainty[[1]]
uc3$uncertainty[[1]]

## performance comparisons for 10 species
YYY <- cbind(Y, Y, Y, Y, Y, Y, Y, Y, Y, Y)
colnames(YYY) <- LETTERS[1:10]
o <- opticut(YYY ~ x2, strata=x0, dist="poisson", comb="rank")

library(parallel)
cl <- makeCluster(2)
system.time(uncertainty(o, type="asymp", B=5000))
system.time(uncertainty(o, type="asymp", B=5000, cl=cl))

system.time(uncertainty(o, type="boot", B=100))
system.time(uncertainty(o, type="boot", B=100, cl=cl))

system.time(uncertainty(o, type="multi", B=100))
system.time(uncertainty(o, type="multi", B=100, cl=cl))
stopCluster(cl)

psolymos/opticut documentation built on May 25, 2024, 6:59 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

psolymos/opticut
Likelihood Based Optimal Partitioning and Indicator Species Analysis

In psolymos/opticut: Likelihood Based Optimal Partitioning and Indicator Species Analysis

Introduction

Theory

The quest for optimal binary partitioning

Finding all possible binary partitions

Poisson count model example

Not using all possible partitions

Distributions

Gaussian

Binomial

Poisson: Mite data set -- high performance computing

Percentages

Dune data, cover type data as ordinal

Varespec data (% cover)

Stratigraphy example

Presence-only data

Custom distributions

Mixed models

Imperfect detectability: N-mixture case

Sampling differences: using offsets

GAM models

Finding best partitions

Uncertainty

R Package Documentation

Browse R Packages

We want your feedback!

psolymos/opticut Likelihood Based Optimal Partitioning and Indicator Species Analysis

In psolymos/opticut: Likelihood Based Optimal Partitioning and Indicator Species Analysis

Introduction

Theory

The quest for optimal binary partitioning

Finding all possible binary partitions

Poisson count model example

Not using all possible partitions

Distributions

Gaussian

Binomial

Poisson: Mite data set -- high performance computing

Percentages

Dune data, cover type data as ordinal

Varespec data (% cover)

Stratigraphy example

Presence-only data

Custom distributions

Mixed models

Imperfect detectability: N-mixture case

Sampling differences: using offsets

GAM models

Finding best partitions

Uncertainty

R Package Documentation

Browse R Packages

We want your feedback!

psolymos/opticut
Likelihood Based Optimal Partitioning and Indicator Species Analysis