abdiv: Alpha and Beta Diversity Measures

#' Berger-Parker dominance
#'
#' The Berger-Parker dominance is the proportion of the most abundant species.
#'
#' @param x A numeric vector of species counts or proportions.
#' @return The Berger-Parker dominance, \eqn{0 < D_{BP} \leq 1}. If the vector
#'   sums to zero, the Berger-Parker dominance is undefined, and we return
#'   \code{NaN}.
#' @details
#' \itemize{
#'   \item Equivalent to \code{berger_parker_d()} in
#'     \code{skbio.diversity.alpha}.
#'   \item Equivalent to the \code{bergerparker} calculator in Mothur.
#' }
#' @references
#' Berger WH, Parker FL. Diversity of Planktonic Foraminifera in Deep-Sea
#' Sediments. Science. 1970;168(3937):1345-1347.
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' berger_parker_d(x) # 15 / 28
#' @export
berger_parker_d <- function (x) {
  max(x) / sum(x)
}

#' Simpson's index and related measures
#'
#' These measures are based on the sum of squared species proportions. The
#' function \code{dominance()} gives this quantity, \code{simpson()} gives one
#' minus this quantity, \code{invsimpson()} gives the reciprocal of the
#' quantity, and \code{simpson_e} gives the reciprocal divided by the number
#' of species.
#'
#' @param x A numeric vector of species counts or proportions.
#' @return The value of the dominance (\eqn{0 < D \leq 1}), Simpson index, or
#'   inverse Simpson index. The dominance is undefined if the vector sums to
#'   zero, in which case we return \code{NaN}.
#' @details
#' For a vector of species counts \code{x}, the dominance index is defined as
#' \deqn{D = \sum_i p_i^2,} where \eqn{p_i} is the species proportion,
#' \eqn{p_i = x_i / N}, and \eqn{N} is the total number of counts. This is
#' equal to the probability of selecting two individuals from the same species,
#' with replacement. Relation to other definitions:
#' \itemize{
#'   \item Equivalent to \code{dominance()} in \code{skbio.diversity.alpha}.
#'   \item Similar to the \code{simpson} calculator in Mothur. They use the
#'     unbiased estimate \eqn{p_i = x_i (x_i - 1) / (N (N -1))}.
#' }
#'
#' Simpson's index is defined here as \eqn{1 - D}, or the probability of
#' selecting two individuals from different species, with replacement. Relation
#' to other definitions:
#' \itemize{
#'   \item Equivalent to \code{diversity()} in \code{vegan} with
#'     \code{index = "simpson"}.
#'   \item Equivalent to \code{simpson()} in \code{skbio.diversity.alpha}.
#' }
#'
#' The inverse Simpson index is \eqn{1/D}. Relation to other definitions:
#' \itemize{
#'   \item Equivalent to \code{diversity()} in \code{vegan} with
#'     \code{index = "invsimpson"}.
#'   \item Equivalent to \code{enspie()} in \code{skbio.diversity.alpha}.
#'   \item Similar to the \code{invsimpson} calculator in Mothur. They use
#'     the unbiased estimate \eqn{p_i = x_i (x_i - 1) / (N (N -1))}.
#' }
#'
#' Simpson's evenness index is the inverse Simpson index divided by the
#' number of species observed, \eqn{1 / (D S)}. Relation to other definitions:
#' \itemize{
#'   \item Equivalent to \code{simpson_e()} in \code{skbio.diversity.alpha}.
#' }
#'
#' Please be warned that the naming conventions vary between sources. For
#' example Wikipedia calls \eqn{D} the Simpson index and \eqn{1 - D} the
#' Gini-Simpson index. We have followed the convention from \code{vegan}, to
#' avoid confusion within the \code{R} ecosystem.
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' dominance(x)
#'
#' # Simpson is 1 - D
#' simpson(x)
#' 1 - dominance(x)
#'
#' # Inverse Simpson is 1/D
#' invsimpson(x)
#' 1 / dominance(x)
#'
#' # Simpson's evenness is 1 / (D * S)
#' simpson_e(x)
#' 1 / (dominance(x) * richness(x))
#' @export
simpson <- function (x) {
  p <- x / sum(x)
  1 - sum(p ** 2)
}

#' @rdname simpson
#' @export
dominance <- function (x) {
  p <- x / sum(x)
  sum(p ** 2)
}

#' @rdname simpson
#' @export
invsimpson <- function (x) {
  p <- x / sum(x)
  1 / sum(p ** 2)
}

#' @rdname simpson
#' @export
simpson_e <- function (x) {
  p <- x / sum(x)
  D <- sum(p ** 2)
  S <- sum(x > 0)
  1 / (D * S)
}

#' Kempton-Taylor Q index
#'
#' The Kempton-Taylor Q index is designed to measure species in the middle of
#' the abundance distribution.
#'
#' @param x A numeric vector of species counts or proportions.
#' @param lower_quantile,upper_quantile Lower and upper quantiles of the
#'   abundance distribution. Default values are the ones suggested by Kempton
#'   and Taylor.
#' @return The Kempton-Taylor Q index, \eqn{Q < 0}. If the vector sums to zero,
#'   we cannot compute the quantiles, and this index is undefined. In that
#'   case, we return \code{NaN}.
#' @details
#' For a vector of species counts \code{x}, the Kempton-Taylor Q statistic is
#' equal to the slope of the cumulative abundance curve across a specified
#' quantile range. The cumulative abundance curve is the plot of the number of
#' species against the log-abundance.
#'
#' Kempton and Taylor originally defined the index as
#' \deqn{Q = \frac{\frac{1}{2}S}{\log{R_2} - \log{R_1}},} where \eqn{S} is the
#' total number of species observed, \eqn{R_1} is the abundance at the lower
#' quantile, and \eqn{R_2} is the abundance at the upper quantile. However,
#' this definition only holds if one uses the interquartile range. Because we
#' allow the user to adjust the upper and lower quantiles, we have to find the
#' number of species at these abundance values. Here, we follow the
#' implementation in \code{scikit-bio} and round inwards to find the quantile
#' values, taking the number of species and log-abundance values at these data
#' points exactly.
#'
#' \itemize{
#'   \item Equivalent to \code{kempton_taylor_q()} in
#'     \code{skbio.diversity.alpha}.
#'   \item Similar to the \code{qstat} calculator in Mothur. Our implementation
#'     differs slightly, and this difference affects the result.
#' }
#' @references
#' Kempton RA, Taylor LR. Models and statistics for species diversity. Nature.
#' 1976;262:818-820.
#' @export
kempton_taylor_q <- function (x, lower_quantile=0.25, upper_quantile=0.75) {
  # I'm sure there is a better way to do this with R's quantile function,
  # but not sure how to guarantee that the result always replicates the one
  # obtained via this algorithm.
  # !!!! The scikit-bio implementation seems to not match the paper here.
  # Must check this and make sure the answers line up with their results.
  n <- length(x)
  lower_idx <- ceiling(n * lower_quantile) + 1
  upper_idx <- floor(n * upper_quantile) + 1
  x_sorted <- sort(x)
  x_upper <- x_sorted[upper_idx]
  x_lower <- x_sorted[lower_idx]
  (upper_idx - lower_idx) / log(x_upper / x_lower)
}

#' Margalef's richness index
#'
#' @param x A numeric vector of species counts.
#' @return The value of Margalef's index, \eqn{D \geq 0}. This index is
#'   undefined when the total number of counts is 1 or 0, in which case we
#'   return \code{NaN}.
#' @details
#' For a vector \code{x} of species counts, Margalef's index is
#' \deqn{D = \frac{S -1}{\log N},} where \eqn{S} is the total number of species
#' observed and \eqn{N} is the total number of counts.
#'
#' This index is appropriate only for raw counts, not transformed counts or
#' proportions.
#'
#' Equivalent to \code{margalef()} in \code{skbio.diversity.alpha}.
#' @references
#' Margalef R. Information theory in ecology. General Systems 3. 1958;36-71.
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' margalef(x)
#' @export
margalef <- function (x) {
  # Margalef is based on the slope of the species-area curve as proposed by
  # Gleason (1922), where the number of species increases with the log of the
  # area. Here, the number of individuals, n, is a stand-in for the area. The
  # equation is:
  #   s = s0 + z * log(N)
  # When the number of individuals, n, is one, then z * log(N) is zero. In this
  # case, s is one by definition, so s0 must be one.  Making this substitution
  # and solving for z, we get
  #   z = (s - 1) / log(N)
  # When we only observe one individual, we don't know the slope. Numerically,
  # we encounter zero divided by zero, so we expect NaN as a result. This is
  # what R returns.
  # When we observe no individuals, we also dont't know the slope. Numerically,
  # we encounter negative one over negative infinity, which R interprets as
  # zero. However, if we return to the original equation, we see that z must be
  # a number that multiplies negative infinity to produce negative one. Zero
  # does not fulfill this role, and is therefore not an acceptable answer.
  #   0 = 1 + z * (-Inf)
  # The quantity z is undefined at n = 1, and we also consider it to be
  # undefined at n = 0. Therefore, we take special care to return NaN when
  # n = 0.
  s <- sum(x > 0)
  n <- sum(x)
  if (n == 0) {
    NaN
  } else {
    (s - 1) / log(n)
  }
}

#' McIntosh dominance index D
#' @param x A numeric vector of species counts.
#' @return The McIntosh dominance index, \eqn{0 \leq D < 1}. The index is undefined
#'   when the total number of counts is 1 or 0, in which case we return
#'   \code{NaN}.
#' @details
#' For a vector \code{x} of raw species counts, the McIntosh dominance index is
#' defined as \deqn{D = \frac{N - U}{N - \sqrt{N}},} where \eqn{N} is the total
#' number of counts and \eqn{U = \sqrt{\sum_i x_i^2}}.
#'
#' This index is appropriate only for raw counts, not transformed counts or
#' proportions.
#'
#' Equivalent to \code{mcintosh_d()} in \code{skbio.diversity.alpha}.
#' @references
#' McIntosh RP. An index of diversity and the relation of certain concepts to
#' diversity. Ecology. 1967;48:1115-1126.
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' mcintosh_d(x)
#' @export
mcintosh_d <- function (x) {
  # Equation 4
  n <- sum(x)
  u <- sqrt(sum(x ^ 2))
  (n - u) / (n - sqrt(n))
}

#' McIntosh's evenness measure E
#' @param x A numeric vector of species counts.
#' @return McIntosh's evenness measure, \eqn{0 < E \leq 1}.  The index is
#'   undefined when the total number of counts is 0, in which case we return
#'   \code{NaN}.
#' @details
#' For a vector \code{x} of raw species counts, the McIntosh evenness measure
#' is \deqn{E = \frac{\sqrt{\sum_i x_i^2}}{\sqrt{(N - S + 1)^2 + S - 1},}}
#' where \eqn{N} is the total number of counts and \eqn{S} is the total
#' number of species observed.
#'
#' This index is appropriate only for raw counts, not transformed counts or
#' proportions.
#'
#' Equivalent to \code{mcintosh_e()} in \code{skbio.diversity.alpha}.
#' @references
#' Heip C, Engels P. Comparing Species Diversity and Evenness Indices. J. Mar.
#' Bioi. Ass. U.K. 1974;54:559-563.
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' mcintosh_e(x)
#' @export
mcintosh_e <- function (x) {
  n <- sum(x)
  s <- sum(x > 0)
  numerator <- sqrt(sum(x ^ 2))
  denominator <- sqrt((n - s + 1) ^ 2 + s - 1)
  numerator / denominator
}

#' Menhinick's richness index
#' @param x A numeric vector of species counts.
#' @return Menhinick's richness index, \eqn{R > 0}. The index is undefined when
#'   the total number of counts is 0, in which case we return \code{NaN}.
#' @details
#' For a vector \code{x} of raw species counts, the Menhinick's richness index
#' is \eqn{\frac{S}{\sqrt{N}}}, where \eqn{N} is the total number
#' of counts and \eqn{S} is the total number of species observed.
#'
#' This index is appropriate only for raw counts, not transformed counts or
#' proportions.
#'
#' Equivalent to \code{menhinick()} in \code{skbio.diversity.alpha}.
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' menhinick(x)
#' @export
menhinick <- function (x) {
  n <- sum(x)
  s <- sum(x > 0)
  s / sqrt(n)
}

#' Richness or number of observed species
#' @param x A numeric vector of species counts or proportions.
#' @return The number of species observed, \eqn{R \geq 0}.
#' @details The richness is simply the number of nonzero elements in \code{x}.
#' Relation to other definitions:
#' \itemize{
#'   \item Equivalent to \code{observed_otus()} in \code{skbio.diversity.alpha}.
#'   \item Equivalent to \code{specnumber} in \code{vegan}.
#'   \item Equivalent to the \code{sobs} calculator in Mothur.
#' }
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' richness(x) # 4
#' @export
richness <- function (x) {
  sum(x > 0)
}

#' Shannon diversity and related measures
#'
#' The Shannon index of diversity
#'
#' @param x A numeric vector of species counts or proportions.
#' @param base Base of the logarithm to use in the calculation.
#' @return The Shannon diversity, \eqn{H \geq 0}, or related quantity. The
#'   value of \eqn{H} is undefined if \code{x} sums to zero, and we return
#'   \code{NaN} in this case.  Heip's evenness measure and Pielou's Evenness
#'   index are undefined if only one element of \code{x} is nonzero, and again
#'   we return \code{NaN} if this is the case.
#' @details
#' The Shannon index of diversity or Shannon information entropy has deep roots
#' in information theory. It is defined as \deqn{H = - \sum_i p_i \log{p_i},}
#' where \eqn{p_i} is the species proportion. Relation to other definitions:
#' \itemize{
#'   \item Equivalent to \code{diversity()} in \code{vegan} with
#'     \code{index = "shannon"}.
#'   \item Equivalent to \code{shannon()} in \code{skbio.diversity.alpha}.
#' }
#'
#' The Brillouin index (Brillouin 1956) is similar to Shannon's index, but
#' accounts for sampling without replacement. For a vector of species counts
#' \code{x}, the Brillouin index is
#' \deqn{
#'   \frac{1}{N}\log{\frac{N!}{\prod_i x_i!}} =
#'   \frac{\log{N!} - \sum_i \log{x_i!}}{N}
#' } where \eqn{N} is the total number of counts. Relation to other definitions:
#' \itemize{
#'   \item Equivalent to \code{brillouin_d()} in \code{skbio.diversity.alpha}.
#'   \item Equivalent to the \code{shannon} calculator in Mothur.
#' }
#'
#' The Brillouin index accounts for the total number of individuals sampled,
#' and should be used on raw count data, not proportions.
#'
#' Heip's evenness measure is \deqn{\frac{e^H - 1}{S - 1},} where \eqn{S} is
#' the total number of species observed. Relation to other definitions:
#' \itemize{
#'   \item Equivalent to \code{heip_e()} in \code{skbio.diversity.alpha}.
#' }
#'
#' Pielou's Evenness index \eqn{J = H / \log{S}}. Relation to other
#' definitions:
#' \itemize{
#'   \item Equivalent to \code{peilou_e()} in \code{skbio.diversity.alpha}.
#' }
#'
#' @references
#' Brillouin L. Science and Information Theory. 1956;Academic Press, New York.
#'
#' Pielou EC. The Measurement of Diversity in Different Types of Biological
#' Collections. Journal of Theoretical Biology. 1966;13:131-144.
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' shannon(x)
#'
#' # Using a different base is the same as dividing by the log of that base
#' shannon(x, base = 10)
#' shannon(x) / log(10)
#'
#' brillouin_d(x)
#'
#' # Brillouin index should be almost identical to Shannon index for large N
#' brillouin_d(10000 * x)
#' shannon(10000 * x)
#'
#' heip_e(x)
#' (exp(shannon(x)) - 1) / (richness(x) - 1)
#'
#' pielou_e(x)
#' shannon(x) / log(richness(x))
#' @export
shannon <- function (x, base=exp(1)) {
  p <- x / sum(x)
  # By convention, 0 * log(0) = 0
  p_is_defined_and_zero <- (p == 0) %in% TRUE
  p_logp <- ifelse(p_is_defined_and_zero, 0, p * log(p, base=base))
  -sum(p_logp)
}

#' @rdname shannon
#' @export
brillouin_d <- function (x) {
  n <- sum(x)
  nz <- x[x > 0]
  (lfactorial(n) - sum(lfactorial(nz))) / n
}

#' @rdname shannon
#' @export
heip_e <- function (x) {
  s <- sum(x > 0)
  h <- shannon(x)
  (exp(h) - 1) / (s - 1)
}

#' @rdname shannon
#' @export
pielou_e <- function (x) {
  h <- shannon(x)
  s <- sum(x > 0)
  h / log(s)
}

#' Strong's dominance index
#'
#' Strong's dominance index measures the maximum departure between the observed
#' proportions and a perfectly even community.
#' @param x A numeric vector of species counts.
#' @return Strong's dominance index, \eqn{0 \leq D_W < 1}. The index is
#'   undefined if \code{x} sums to 0, and we return \code{NaN} in this case.
#' @details
#' Strong's dominance index is defined as
#' \deqn{D_W = \max_i \left [ \frac{b_i}{N} - \frac{i}{S} \right ],} where
#' \eqn{b_i} is the abundance of the \eqn{i}th species, ordered from smallest
#' to largest, \eqn{N} is the total number of counts, and \eqn{S} is the number
#' of species observed.
#'
#' Equivalent to \code{strong()} in \code{skbio.diversity.alpha}.
#' @references
#' Strong WL. Assessing species abundance unevenness within and between plant
#' communities. Community Ecology. 2002;3:237-246.
#' @examples
#' x <- c(9, 0, 1, 2, 5, 2, 1, 1, 0, 7, 2, 1, 0, 1, 1)
#' strong(x)
#' @export
strong <- function (x) {
  n <- sum(x)
  s <- sum(x > 0)
  sorted_sum <- cumsum(sort(x, decreasing = TRUE))
  idx <- seq_along(x)
  max((sorted_sum / n) - (idx / s))
}

# Scikit-bio notes
# ace not implemented, TODO
# berger_parker_d implemented
# brillouin_d implemented
# chao1 not implemented, out of scope
# dominance implemented
# doubles not implemented, out of scope
# enspie implemented as invsimpson
# etsy_ci not implemented, out of scope
# faith_pd implemented
# fisher_alpha not implemented, TODO
# gini_index not implemented, TODO
# goods_coverage not implemented, out of scope
# heip_e implemented
# kempton_taylor_q implemented, but needs work
# lladser_ci not implemented, out of scope
# lladser_pe not implemented
# margalef implemented
# mcintosh_d implemented
# mcintosh_e implemented
# menhinick implemented
# michaelis_menten_fit not implemented
# observed_otus implemented as richness
# osd (obs. OTUs, singletons, doubletons) not implemented, out of scope
# pielou_e implemented
# robbins not implemented, out of scope
# shannon implemented
# simpson implemented
# simpson_e implemented
# singles not implemented, out of scope
# strong implemented

# Mothur notes
# ## Community richness
# sobs implemented as richness
# chao not implemented, out of scope
# ace not implemented, out of scope
# jack not implemented, out of scope
# bootstrap not implemented, out of scope
# ## Community evenness
# simpsoneven need to look at implementation, TODO
# shannoneven need to look at implementation, TODO
# heip need to look at implementation, TODO
# smithwilson need to look at implementation, TODO
# ## Community diversity
# bergerparker implemented as berger_parker_d
# shannon implemented
# npshannon not implemented, TODO
# simpson implemented (biased version) as dominance
# invsimpson implemented (biased version)
# coverage (aka Good's coverage) not implemented, out of scope
# qstat implemented as kempton_taylor_q, implementation does not match
# ## Estimates of number of additional OTUs
# boneh not implemented, out of scope
# efron not implemented, out of scope
# shen not implemented, out of scope
# solow not implemented, out of scope