Identification and monitoring of indicator species has long been thought as useful and cost-effective to monitor changes in environmental conditions or status of habitat for biota. Examples of indicator species application include: to characterize certain habitats or vegetation types, to indicate naturalness or degradation of ecosystems, to measure success of habitat restoration, alerting about critical community thresholds, aid in survey and monitoring design, or indicate the presence of cryptic or rare species.
A key attribute of indicator species (also referred also as character or differential) is that they have strong association with the environment or any factor they are supposed to indicate. Approaches to quantify the degree of environmental associations for species or indicator value of species fall into three major types of algorithms:
While the different approaches have a strong appeal and applications,
they do not always meet the challenges ecological data present us.
Thus, ecological data come in different forms distribution types
(typically can be of binary, ordinal, abundance or presence only data)
whose distribution assumptions might not always be handled properly
by these methods. For example, the contingency table is based on binary
(presence/absence data) or arbitrary categorization of abundance data;
this will effectively cause the loss of information when available abundance
data is binarized to meet the analysis framework.
In this regard, the ANOVA approach is better suited to handle abundance data;
however, the parametric assumptions of the ANOVA imply normality and
homoskedastic errors, which might not always be satisfied in most field
situations (e.g. using 0/1, skewed biomass, or % cover data).
Non-parametric approach such as IndVal index that does not assume normality
assumption could handle data of various distribution types. However, the
randomization test which is used to test species-environment associations
might not always be meaningful for continuous (e.g. biomass, or % cover)
or for ordinal (scores for vegetation cover) response data.
Besides, none of these approaches are designed to deal with some other aspects of field data, for example confounding variables, sample selection bias (presence-only data), sampling effort differences, or imperfect detection. The issue of confounding variable could be particularly significant in many field situations. in that if the effect other natural environmental gradients or regional factors are not controlled for, the assessment of indicative value of a species for a particular condition of interest could be compromised. For example, the absence or low abundance of a 'good' indicator species from a given site might render the site in 'bad' condition; however, the species population status might be related to other naturally occurring variability rather than anthropogenic caused stressors. Thus, having an option to control for putative confounding factors can improve the assessment of indicator value of species.
In this paper we introduce opticut (Optimal Cut) an R-package developed to addresses the most common limitations of currently available methods. The limitation are (1) the absence of a common framework that can meet the different distribution assumptions of field data and (2) the option to control for putative confounding factors and evaluate how indicator value of species might change under different conditions. The critical development in Opticut to address these limitations is the implementation of a general and extensible likelihood-based framework and multi-model inference for indicator species analysis. Moreover, the opticut R extension package offers computationally efficient algorithms for finding indicator species, and tools for exploring and visualizing the results.
Likelihood based optimal partitioning for indicator species analysis. Finding the best binary partition for each species based on model selection, possibly controlling for modifying/confounding variables as described in Kemencei et al. (2014, Community Ecology 15:180--186). The package also implements various measures of uncertainty based on binary partitions, optimal multinomial partitioning, and exploratory selection indices, with native support for parallel computations.
Why opticut?
$Y_{i}$'s are observations for a single species from $n$ locations ($i = 1, ..., n$). $g_{i}$'s are known discrete descriptors of the locations with $K$ levels ($K > 2$). $z^{(m)}$ is a binary reclassification of $g$ taking values (0, 1). The superscript $m = 1, ..., M$ indicates a possible combination of binary reclassification out of the total $M = 2^{K-1} - 1$ total combinations (excluding complements). See below for options for defining binary partitions. There can also be other site descriptors denoted as $x_{ij}$ taking discrete or continuous values ($j = 1, ..., p$; number of predictors).
A suitable parametric model describes the relationship between the observations and the site descriptors through the probability density function $P(Y_{i} = y_{i} \mid z_{i}^{(m)}, x_{ij}, \theta)$ where $\theta$ is the vector of model parameters: $\theta = (\beta_{0}, \beta_{1}, \alpha_{1}, ..., \alpha_{p})$. The choice of the parametric model depends on the nature of the observations. It can be Gaussian, Binomial, Poisson, ordinal, Beta regression, or zero-inflated models, with a suitable link function ($f$) for the mean: $f(\eta_{i}) = \beta_{0}^{(m)} + \beta_{1}^{(m)} z_{i}^{(m)} + \sum_{j=1}^{p} \alpha_{j}^{(m)} x_{ij}$.
$\widehat{\theta^{(m)}}$ is the maximum likelihood estimate (MLE) of the model parameters given the data and classification $m$, with corresponding log-likelihood value $l(\widehat{\theta^{(m)}}; y)$. Finding MLEs for all $M$ candidate binary partitions leads to a set of log-likelihood values. One can compare the log-likelihood values to a null model (no binary partition is necessary) where $\beta_{1} = 0$ leading to the MLE $\widehat{\theta^{(0)}}$ and corresponding log-likelihood value for the null model: $l(\widehat{\theta^{(0)}}; y)$.
The log-likelihood ratio for each candidate partition can be calculated as $l(\widehat{\theta^{(m)}}; y) - l(\widehat{\theta^{(0)}}; y)$. The best supported binary partition is the model with the highest log-likelihood ratio value.
One way of calculating the indicator value for each candidate partition is based on expected values using the inverse link function as $\mu_{0}^{(m)} = f^{-1}(\beta_{0}^{(m)})$ and $\mu_{1}^{(m)} = f^{-1}(\beta_{0}^{(m)} + \beta_{1}^{(m)})$. $I = 1 - min(\mu_{0}^{(m)}, \mu_{1}^{(m)}) / max(\mu_{0}^{(m)}, \mu_{1}^{(m)})$. Where $\mu_{0}^{(m)} = E[Y_{i} \mid z_{i}^{(m)}=0, x_{ij}=0]$ and $\mu_{1}^{(m)} = E[Y_{i} \mid z_{i}^{(m)}=1, x_{ij}=0]$ are expected values for the observations given the binary partition $z_{i}^{(m)}$ and at 0 value for all $x_{ij}$. This approach can be sensitive to the range of values supported by the link function. For example it works nicely with logarithmic or logistic link function where non-negativity of predicted values is ensured by definition. This is, however, not the case for the identity link in the Gaussian case, when negative values can invaludate the indicator value calculations as described above. (This usually happens when confounding variables are not centered and the intercept then reflects that difference as part of the baseline.)
As an alternative, one can use the estimate $\beta_{1}^{(m)}$ itself to express the contrast between the two strata. This also makes the index more comparable when different link functions are used. We used the hyperbolic tangent function (or inverse Fisher's $z$ transform) to scale the real valued $\beta_{1}^{(m)}$ into the unit range (0-1): $I = tanh(\mid \beta_{1}^{(m)} \mid) = \frac{exp(2 \mid \beta_{1}^{(m)} \mid) - 1}{exp(2 \mid \beta_{1}^{(m)} \mid) + 1}$. Positive and negative cases are taken as absolute values, so that the index reflects only the contrast between strata, and not the direction of it. Negative value can happen when using all combinations.
Finding all combinations does not require a model or observed responses. It only takes a classification vector with $K > 1$ partitions.
kComb
returns a 'contrast' matrix corresponding to
all possible binary partitions of the factor with K
levels.
Complements are not counted twice, i.e.
(0,0,1,1) is equivalent to (1,1,0,0).
The number of such possible combinations is $M = 2^{K-1} - 1$.
Get the package and load it:
#devtools::install_github("psolymos/opticut") #devtools::install("~/repos/opticut") #devtools::check("~/repos/opticut") #devtools::build("~/repos/opticut", binary=TRUE) library(opticut)
kComb(k = 2) kComb(k = 3) kComb(k = 4)
allComb
this takes a classification vector with at least 2 levels
and returns a model matrix with binary partitions. checkComb
checks if combinations are unique and non-complementary
(misfits are returned as attributes).
(f <- rep(LETTERS[1:4], each=2)) (mc <- allComb(f, collapse = "_")) checkComb(mc) mc2 <- cbind(z = 1 - mc[,1], mc[,c(1:ncol(mc), 1)]) colnames(mc2) <- 1:ncol(mc2) mc2 checkComb(mc2)
set.seed(1234) n <- 200 x0 <- sample(1:4, n, TRUE) x1 <- ifelse(x0 %in% 1:2, 1, 0) x2 <- rnorm(n, 0.5, 1) table(x0,x1) lam1 <- exp(0.5 + 0.5*x1 + -0.2*x2) boxplot(lam1~x0) Y1 <- rpois(n, lam1) lam2 <- exp(0.1 + 0.5*ifelse(x0==4,1,0) + 0.2*x2) boxplot(lam2~x0) Y2 <- rpois(n, lam2) lam3 <- exp(0.1 + -0.2*x2) boxplot(lam3~x0) Y3 <- rpois(n, lam3) Y <- cbind(SPP1=Y1, SPP2=Y2, SPP3=Y3) X <- model.matrix(~x2) Z <- allComb(x0) opticut1(Y1, X, Z, dist="poisson") opticut1(Y2, X, Z, dist="poisson") opticut1(Y3, X, Z, dist="poisson") summary(m <- opticut(Y ~ x2, strata=x0, dist="poisson", comb="all")) plot(m, cut=-Inf)
Describe here what is what in the output.
Blindly fitting a model to all possible partitions is wasteful use of resources. Instead, one can rank the $K$ partitions based on expected response values ($\mu_{1}, ..., \mu_{k}, ..., \mu_{K}$, where $\mu_{k}=E[Y_{i} \mid g_{i}=k, x_{ij}=0]$). This way we have to explore only $K-1$ partitions:
oComb(1:4)
oComb
return the 'contrast' matrix based on the rank vector as input.
Rank 1 means lowest expected value among the partitions.
The function rankComb
fits the model with multiple ($K > 2$) factor levels
to find out the ranking, and returns a binary classification matrix
similarly to allComb
:
head(rc <- rankComb(Y1, model.matrix(~x2), as.factor(x0), dist="poisson")) attr(rc, "est")
Note that the ranking varies from species to species, thus
it is not possible to supply the resulting matrix as
strata
definition:
summary(opticut(Y ~ x2, strata=x0, dist="poisson", comb="rank"))
There is an overhead of fitting the model to calculate the ranking. But computing efficiencies can be still high compared to all partitions when the number of levels ($k$) is high.
A downside of this approach is that not all possible partitions are explored, thus the model weights do not represent all possible models, but only the top candidates. Thus model weight interpretation is different (i.e. cannot be used as a reliability matric, especially when support for the best model is not dominant).
Currently available distributions:
"gaussian"
: real valued continuous observations, e.g. biomass,"poisson"
: Poisson count data,"binomial"
: presence-absence type data,"negbin"
: overdispersed Negative Binomial count data,"beta"
: continuous response in the unit interval, e.g. percent cover,"zip"
, "zip2"
: zero-inflated Poisson counts (partitioning in count model:
"zip"
, or in zero model: "zip2"
),"zinb"
, "zinb"
: zero-inflated Negative Binomial counts
(partitioning in count model: "zinb"
, or in zero model: "zinb2"
),"ordered"
: response measured on ordinal scale, e.g. ordinal vegetation cover,"rsf"
, "rspf"
: presence-only data using resource selection and resource selection
probability functions.Y <- rnorm(n, log(lam1) + 10, 0.5) (mod <- opticut(Y ~ x2, strata=x0, dist="gaussian"))
Legendre example
gr <- rep(1:5, each=5) spp <- cbind(Sp1=rep(c(4,6,5,3,2), each=5), Sp2=c(rep(c(8,4,6), each=5), 4,4,2, rep(0,7)), Sp3=rep(c(18,2,0,0,0), each=5)) rownames(spp) <- gr spp summary(mod <- opticut(spp ~ 1, strata=gr, dist="gaussian", comb="all")) summary(mod <- opticut(spp ~ 1, strata=gr, dist="gaussian", comb="rank"))
## DeCaceres & Legendre 2013 Oikos example from Fig 1 ## Oikos 119: 1674-1684, 2010 ## doi: 10.1111/j.1600-0706.2010.18334.x Y <- c(0, 0, 3, 0, 2, 3, 0, 5, 5, 6, 3, 4) z <- rep(1:3, each=4) Z <- allComb(z) Z <- cbind(Z, 1-Z) colnames(Z) <- c("1", "2", "3", "2 3", "1 3", "1 2") Z try(opticut1(Y, Z=Z)) oc <- ocoptions(check_comb=FALSE, cut=-Inf) # relax the checks opticut1(Y, Z=Z) # identical results for complementary partitions ocoptions(oc) # restore defaults print(opticut1(Y, Z=allComb(z)), cut=-Inf) print(opticut1(Y, Z=as.factor(z)), cut=-Inf)
## figure example y <- cbind( Spp1=c(4,6,3,5,5,6,3,4,4,1,3,2), Spp2=c(0,0,0,0,1,0,0,1,4,2,3,4), Spp3=c(0,0,3,0,2,3,0,5,5,6,3,4)) g <- c(1,1,1,1, 2,2,2,2, 3,3,3,3) x <- c(0.1,-0.2,1,0,-0.5,-1,0,0.5,0,0.8,-0.3,0.1) m <- opticut(formula = y ~ 1, strata = g, dist = "poisson") print(summary(m), cut=-Inf) plot(m, ylab="", cut=-Inf, sort=FALSE, show_I=FALSE, show_S=FALSE) ## this breaks set.seed(1) try(u <- uncertainty(m, type = "multi")) ## see uncertainty examples ## for more sophisticated ways of dealing with this issue: ## e.g. jackknife B <- sapply(1:length(g), function(i) which((1:length(g)) != i)) check_strata(m, B) # check representation u <- uncertainty(m, type="multi", B=B) summary(u) bestpart(u)
BCI data
library (vegan) data (BCI) library (BiodiversityR) # available from R version 2.15.1, not older! data (BCI.env) BCI.soil <- read.delim ('http://www.davidzeleny.net/anadat-r/lib/exe/fetch.php?media=data:bci.soil.txt') ### BCI.hab <- read.table("http://www.kharms.biology.lsu.edu/TORUS_Habitats.txt", sep="\t", header=TRUE)
set.seed(1234) n <- 1000 x0 <- sample(1:4, n, TRUE) x1 <- ifelse(x0 %in% 1:2, 1, 0) x2 <- rnorm(n, 0.5, 1) table(x0,x1) p1 <- plogis(0.5 + 0.5*x1 + -0.2*x2) boxplot(p1~x0) Y1 <- rbinom(n, 1, p1) p2 <- plogis(0.1 + 0.5*ifelse(x0==4,1,0) + 0.2*x2) boxplot(p2~x0) Y2 <- rbinom(n, 1, p2) Y <- cbind(SPP1=Y1, SPP2=Y2) X <- model.matrix(~x2) Z <- allComb(x0) summary(opticut(Y ~ x2, strata=x0, dist="binomial"))
See computing time diffs and plotting options.
library(vegan) data(mite) data(mite.env) mite.env$hab <- with(mite.env, interaction(Shrub, Topo, drop=TRUE)) summary(mod0 <- opticut(as.matrix(mite) ~ SubsDens, mite.env, strata=mite.env$hab, dist="poisson", comb="all")) plot(mod0) system.time(aa <- opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson", comb="rank")) system.time(bb <- opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson", comb="all")) ## sequential system.time(opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson")) ## parallel -- compare system times library(parallel) cl <- makeCluster(3) system.time(opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson", cl=cl)) stopCluster(cl) ## forking -- will not work on Windows if (!.Platform$OS.type == "windows") system.time(opticut(as.matrix(mite) ~ 1, strata=mite.env$hab, dist="poisson", cl=3))
See http://www.davidzeleny.net/anadat-r/doku.php/en:data:dune
library(vegan) data(dune) data(dune.env) dune.env$manure <- as.integer(dune.env$Manure) - 1 dune.env$moisture <- as.integer(dune.env$Moisture) - 1 oc <- ocoptions(collapse="+", cut=-Inf) ## ordinal regr ## (when nlevels() < 3 use logistic regression instead !!! #Dune <- as.matrix(dune) #Dune <- Dune[,apply(Dune, 2, function(z) length(unique(z)))>2] #x1 <- opticut(Dune ~ 1, dune.env, strata=Management, dist="ordered") #summary(x1) #plot(x1, mar=c(5,5,3,3)) ## Binarizing data Dune01 <- ifelse(as.matrix(dune)>0,1,0) x2 <- opticut(Dune01 ~ 1, strata=dune.env$Management, dist="binomial") summary(x2) plot(x2) x3 <- opticut(Dune01 ~ manure + moisture, dune.env, strata=dune.env$Management, dist="binomial") summary(x3) plot(x3) ## Beta regression Dune2 <- as.matrix(dune+0.5) / 10 x4 <- opticut(Dune2 ~ 1, strata=dune.env$Management, dist="beta") summary(x4) plot(x4) x5 <- opticut(Dune2 ~ manure, dune.env, strata=dune.env$Management, dist="beta") summary(x5) plot(x5) xx <- data.frame(#ord0=summary(x1)$summary$split, bin0=summary(x2)$summary$split, binx=summary(x3)$summary$split, bet0=summary(x4)$summary$split, betx=summary(x5)$summary$split) rownames(xx) <- rownames(summary(x2)$summary) xx ocoptions(oc)
library(vegan) data(varespec) data(varechem) y <- as.matrix(varespec / 100) range(y[y>0]) y[y <= 0] <- 0.0001 y <- y[,apply(y, 2, max) > 0.05] varechem$grazing <- as.factor(ifelse(rownames(varechem) %in% c(5,6,7,8,13,14,15,16, 18,19,20,22,23,24,26), "grazed", "ungrazed")) x <- opticut(y ~ 1, varechem, strata=grazing, dist="beta") summary(x) plot(x, cut=-Inf)
Implement ZI-Beta (quite unreliable for such small data set)
zi_beta_fun <- function(Y, X, linkinv, ...) { kx <- ncol(X) id1 <- Y > 0 id0 <- !id1 nll_ZIB_ML <- function(parms) { mu <- plogis(X %*% parms[1:kx]) gamma <- exp(parms[kx + 1]) # precision phi <- plogis(parms[kx+2]) alpha <- mu * gamma beta <- (1 - mu) * gamma loglik0 <- log(phi) loglik1 <- log(1 - phi) + suppressWarnings(dbeta(Y, alpha, beta, log = TRUE)) loglik <- sum(loglik0[id0]) + sum(loglik1[id1]) if (!is.finite(loglik) || is.na(loglik)) loglik <- -.Machine$double.xmax^(1/3) -loglik } Yv <- Y Yv[Y <= 0.001] <- 0.001 ini <- c(coef(betareg::betareg(Yv ~ .-1, data=X)), zi=-5) X <- as.matrix(X) res <- optim(ini, nll_ZIB_ML, ...) list(coef=res$par, logLik=-res$value, linkinv=binomial("logit")$linkinv) } y <- as.matrix(varespec / 100) range(y[y>0]) y <- y[,apply(y, 2, max) > 0.05] zi_beta_fun(y[,3], data.frame(matrix(1, nrow(y), 1))) opticut1(y[,1], matrix(1, nrow(y), 1), varechem$grazing, dist=zi_beta_fun)
library(rioja) data(aber) strat.plot(aber$spec, aber$ages$Depth, scale.percent=TRUE, y.rev=TRUE) z <- as.factor(cut(aber$ages$Depth, 5)) ab <- as.matrix(aber$spec) / 100 ab[ab == 0] <- 0.0001 ab <- ab[,apply(ab, 2, max) > 0.05] a <- opticut(ab ~ 1, strata=z, comb="rank", dist="beta") summary(a) plot(a, sort=FALSE, horizontal=FALSE, pos=1) bp <- bestpart(a) op <- par(mfrow=c(3,4), mar=c(2,2,1,1)) for (i in 1:12) { plot(ab[,i], aber$ages$Depth, type="l", ann=FALSE) segments(x0=rep(0, nrow(ab)), y0=aber$ages$Depth, x1=ab[,i], col=ifelse(bp[,i] > 0, 2, 1)) title(main=colnames(ab)[i]) } par(op)
Describe RSF/RSPF differences especially related to covariates.
## presence-only data ## single species model only: ## because the used distr is different for ## each species by definition. library(ResourceSelection) ## settings n.used <- 1000 m <- 10 n <- n.used * m set.seed(1234) x <- data.frame(x0=as.factor(sample(1:3, n, replace=TRUE)), x1=rnorm(n), x2=runif(n)) cfs <- c(1, -0.5, 0.1, -1, 0.5) ## Logistic RSPF model dd <- simulateUsedAvail(x, cfs, n.used, m, link="logit") Y <- dd$status X <- model.matrix(~ x1 + x2, dd) Z <- allComb(as.integer(dd$x0)) mod1 <- opticut(Y ~ x1 + x2, dd, strata=x0, dist="rsf") mod1$species mod2 <- opticut(Y ~ x1 + x2, dd, strata=x0, dist="rspf") mod2$species
The distr
argument accepts a function, so other parametric models
can be supplied which are avoided due to package dependencies.
Here is an example using mixed models and the package lme4
:
library(lme4) set.seed(1234) n <- 200 x0 <- sample(1:4, n, TRUE) x1 <- ifelse(x0 %in% 1:2, 1, 0) x2 <- rnorm(n, 0.5, 1) ee <- rnorm(n/5) g <- rep(1:5, each=n/5) lam1 <- exp(0.5 + 0.5*x1 + -0.2*x2 + ee[g]) Y1 <- rpois(n, lam1) X <- model.matrix(~x2) Z <- allComb(x0) lmefun <- function(Y, X, linkinv, gr, ...) { X <- as.matrix(X) m <- glmer(Y ~ X-1 + (1|gr), family=poisson("log"), ...) list(coef=fixef(m), logLik=logLik(m), linkinv=family(m)$linkinv) } lmefun(Y1, X, gr=g) opticut1(Y1, X, Z, dist=lmefun, gr=g)
A single-visit based N-mixture is an example where detection error is estimated. Let us compare results based on naive GLM and N-mixture:
library(detect) set.seed(2345) n <- 500 x0 <- sample(1:4, n, TRUE) x1 <- ifelse(x0 %in% 1:2, 1, 0) x2 <- rnorm(n, 0.5, 1) x3 <- runif(n, 0, 1) lam <- exp(0.5 + 1*x1 + -0.2*x2) p <- plogis(2 + -2*x3) Y <- rpois(n, lam*p) X <- model.matrix(~x2) op <- par(mfrow=c(1,2)) boxplot((lam*p) ~ x0, ylab="lam*p", xlab="x0") boxplot(lam ~ x0, ylab="lam", xlab="x0") par(op) svfun <- function(Y, X, linkinv, ...) { X <- as.matrix(X) m <- svabu(Y ~ X-1 | x3, ...) list(coef=coef(m, "sta"), logLik=logLik(m), linkinv=poisson()$linkinv) } svfun(Y, X) ## naive GLM print(opticut1(Y, X, as.factor(x0), dist="poisson"), cut=-Inf) ## N-mixture print(opticut1(Y, X, as.factor(x0), dist=svfun), cut=-Inf)
Not accounting for unequal sampling effort can be quite misleading, especially
if that is related to habitat classes. This example shows how to take
advantage of the other arguments passed to the ...
in the opticut
function.
set.seed(1234) n <- 50 x0 <- sample(1:4, n, TRUE) x1 <- ifelse(x0 %in% 1:2, 1, 0) x2 <- rnorm(n, 0.5, 1) lam <- exp(0.5 + 1*x1 + -0.2*x2) A <- ifelse(x0 %in% c(1,3), 1, 2) Y <- rpois(n, lam*A) op <- par(mfrow=c(1,2)) boxplot((lam*A) ~ x0, ylab="lam*A", xlab="x0") boxplot(lam ~ x0, ylab="lam", xlab="x0") par(op) ## no offset: incorrect opticut(Y ~ x2, strata=x0, dist="poisson", comb="rank")$species ## with offsets: log Area opticut(Y ~ x2, strata=x0, dist="poisson", offset=log(A), comb="rank")$species
library(mgcv) library(detect) data(oven) oven$veg <- factor(NA, c("agr","open","decid","conif", "mix")) oven$veg[oven$pforest < 0.5] <- "open" oven$veg[oven$pagri > 0.5 & oven$pforest < 0.5] <- "agr" oven$veg[oven$pforest >= 0.5] <- "mix" oven$veg[oven$pforest >= 0.5 & oven$pdecid >= 0.8] <- "decid" oven$veg[oven$pforest >= 0.5 & oven$pdecid < 0.2] <- "conif" table(oven$veg, useNA="always") oven$xlat <- scale(oven$lat) oven$xlong <- scale(oven$long) gamfun <- function(Y, X, linkinv, Data, ...) { X <- as.matrix(X) m <- mgcv::gam(Y ~ X-1 + s(xlat) + s(xlong), Data, ...) list(coef=coef(m), logLik=logLik(m), linkinv=family(m)$linkinv) } x <- ifelse(oven$veg=="agr",1,0) X <- model.matrix(~x) gamfun(oven$count, X, Data=oven, family=poisson) print(opticut1(oven$count, X=X[,1,drop=FALSE], oven$veg, dist=gamfun, Data=oven, family=poisson), cut=-Inf) o <- opticut(count ~ 1, oven, strata=veg, dist=gamfun, Data=oven, family=poisson) summary(o) o <- opticut(count ~ 1, oven, strata=veg, dist="poisson") summary(o) oven$pa <- ifelse(oven$count > 0, 1, 0) o <- opticut(pa ~ 1, oven, strata=veg, dist=gamfun, Data=oven, family=binomial) summary(o) o <- opticut(pa ~ 1, oven, strata=veg, dist="binomial") summary(o)
It is useful to access the best binary partition
set.seed(2345) n <- 50 x0 <- sample(1:4, n, TRUE) x1 <- ifelse(x0 %in% 1:2, 1, 0) x2 <- rnorm(n, 0.5, 1) lam <- exp(0.5 + 1*x1 + -0.2*x2) Y <- rpois(n, lam) o <- opticut(Y ~ x2, strata=x0, dist="poisson", comb="rank") summary(o) bp <- bestpart(o) head(bp)
The model based on the best partition can be returned as:
bestmodel(o, which=1)
the which
argument can be used to subset the species.
Uncertainty in $I$ values might be of interest.
The type
argument for the uncertainty
method can take the following values:
"asymp"
: asymptotic distribution of $I$, $\mu_{0}$ and $\mu_{1}$
based on best partition found for the input object."boot"
: non-parametric bootstrap distribution of $I$, $\mu_{0}$ and $\mu_{1}$
based on best partition found for the input object."multi"
: non-parametric bootstrap distribution of $I$, $\mu_{0}$ and $\mu_{1}$
based on best partition found for the bootstrap data (i.e.
the model ranking is re-evaluated each time).uc1 <- uncertainty(o, type="asymp", B=5000) uc2 <- uncertainty(o, type="boot", B=200) uc3 <- uncertainty(o, type="multi", B=200) uc1$uncertainty[[1]] uc2$uncertainty[[1]] uc3$uncertainty[[1]] ## performance comparisons for 10 species YYY <- cbind(Y, Y, Y, Y, Y, Y, Y, Y, Y, Y) colnames(YYY) <- LETTERS[1:10] o <- opticut(YYY ~ x2, strata=x0, dist="poisson", comb="rank") library(parallel) cl <- makeCluster(2) system.time(uncertainty(o, type="asymp", B=5000)) system.time(uncertainty(o, type="asymp", B=5000, cl=cl)) system.time(uncertainty(o, type="boot", B=100)) system.time(uncertainty(o, type="boot", B=100, cl=cl)) system.time(uncertainty(o, type="multi", B=100)) system.time(uncertainty(o, type="multi", B=100, cl=cl)) stopCluster(cl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.