Nothing
#' Euclidean and related distances
#'
#' These distance and diversity measures are mathematically similar to the
#' Euclidean distance between two vectors.
#'
#' @param x,y Numeric vectors
#' @return The distance between \code{x} and \code{y}. The chord distance,
#' Hellinger distance, and geodesic metric are not defined if all elements
#' of either vector are zero. We return \code{NaN} in this case.
#' @details
#' For vectors \code{x} and \code{y}, the Euclidean distance is defined as
#' \deqn{d(x, y) = \sqrt{\sum_i (x_i - y_i) ^ 2}.}
#' Relation of \code{euclidean()} to other definitions:
#' \itemize{
#' \item Equivalent to R's built-in \code{dist()} function with
#' \code{method = "euclidean"}.
#' \item Equivalent to \code{vegdist()} with \code{method = "euclidean"}.
#' \item Equivalent to the \code{euclidean()} function in
#' \code{scipy.spatial.distance}.
#' \item Equivalent to the \code{structeuclidean} calculator in Mothur, to
#' \code{speciesprofile} if \code{x} and \code{y} are transformed to
#' relative abundance, and to \code{memeuclidean} if \code{x} and \code{y}
#' are transformed to presence/absence.
#' \item Equivalent to \eqn{D_1}{D_1} in Legendre & Legendre.
#' \item Equivalent to the \emph{distance between species profiles},
#' \eqn{D_{18}}{D_18} in Legendre & Legendre if \code{x} and \code{y} are
#' transformed to relative abundance.
#' }
#'
#' The \emph{root-mean-square} distance or \emph{average} distance is similar
#' to Euclidean distance. As the name implies, it is computed as the square
#' root of the mean of the squared differences between elements of \code{x}
#' and \code{y}: \deqn{d(x, y) = \sqrt{\frac{1}{n} \sum_i^n (x_i - y_i) ^ 2}.}
#' Relation of \code{rms_distance()} to other definitions:
#' \itemize{
#' \item Equivalent to \eqn{D_2}{D_2} in Legendre & Legendre.
#' }
#'
#' The \emph{chord} distance is the Euclidean distance after scaling each
#' vector by its root sum of squares, \eqn{\sqrt{\sum_i x_i^2}}. The chord
#' distance between any two vectors ranges from 0 to \eqn{\sqrt{2}}{sqrt(2)}.
#' Relation of \code{chord()} to other definitions:
#' \itemize{
#' \item Equivalent to \eqn{D_3}{D_3} in Legendre & Legendre.
#' }
#'
#' The \emph{Hellinger} distance is equal to the chord distance computed after
#' a square-root transformation. Relation of \code{hellinger()} to other
#' definitions:
#' \itemize{
#' \item Equivalent to \eqn{D_{17}}{D_17} in Legendre & Legendre.
#' \item Equivalent to the \code{hellinger} calculator in Mothur.
#' }
#'
#' The \emph{geodesic metric} is a transformed version of the chord distance.
#' \deqn{d(x, y) = \textrm{arccos} \left(1 - \frac{d_c^2(x, y)}{2} \right),}
#' where \eqn{d_c} is the chord distance. It gives the length of the arc on a
#' hypersphere between the vectors, if the vectors are normalized to unit
#' length. Relation of \code{geodesic_metric()} to other definitions:
#' \itemize{
#' \item Equivalent to \eqn{D_4}{D_4} in Legendre & Legendre.
#' }
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' y <- c(10, 2, 0, 1, 1, 0)
#' euclidean(x, y)
#' # The "distance between species profiles"
#' euclidean(x / sum(x), y / sum(y))
#' rms_distance(x, y)
#' chord(x, y)
#' hellinger(x, y)
#' # Hellinger is chord distance after square root transform
#' chord(sqrt(x), sqrt(y))
#' geodesic_metric(x, y)
#'
#' # No species in common with x
#' v <- c(0, 0, 0, 5, 0, 5)
#' chord(v, x)
#' sqrt(2)
#' @export
euclidean <- function (x, y) {
sqrt(sum((y - x) ^ 2))
}
#' @rdname euclidean
#' @export
rms_distance <- function (x, y) {
sqrt(mean((y - x) ^ 2))
}
#' @rdname euclidean
#' @export
chord <- function (x, y) {
x <- x / sqrt(sum(x ^ 2))
y <- y / sqrt(sum(y ^ 2))
euclidean(x, y)
}
#' @rdname euclidean
#' @export
hellinger <- function (x, y) {
x <- x / sum(x)
y <- y / sum(y)
chord(sqrt(x), sqrt(y))
}
#' @rdname euclidean
#' @export
geodesic_metric <- function (x, y) {
acos(1 - (chord(x, y) ^ 2) / 2)
}
#' Kullback-Leibler divergence
#'
#' @param x,y Numeric vectors representing probabilities
#' @return The Kullback-Leibler divergence between \code{x} and \code{y}. We
#' adopt the following conventions if elements of \code{x} or \code{y} are
#' zero: \eqn{0 \log (0 / y_i) = 0}, \eqn{0 \log (0 / 0) = 0}, and
#' \eqn{x_i \log (x_i / 0) = \infty}. As a result, if elements of \code{x} are
#' zero, they do not contribute to the sum. If elements of \code{y} are zero
#' where \code{x} is nonzero, the result will be \code{Inf}. If either
#' \code{x} or \code{y} sum to zero, we are not able to compute the
#' proportions, and we return \code{NaN}.
#' @details
#' Kullback-Leibler divergence is a non-symmetric measure of difference between
#' two probability vectors. In general, KL(x, y) is not equal to KL(y, x).
#'
#' Because this measure is defined for probabilities, the vectors x and y are
#' normalized in the function so they sum to 1.
#' @export
kullback_leibler_divergence <- function (x, y) {
x <- x / sum(x)
y <- y / sum(y)
terms <- x * log(x / y)
# By convention, 0 * log(0 / y) = 0
x_is_defined_and_zero <- (x == 0) %in% TRUE
sum(ifelse(x_is_defined_and_zero, 0, terms))
}
#' Manhattan and related distances
#'
#' The Manhattan or city block distance is the sum of absolute differences
#' between the elements of two vectors. The \emph{mean character} difference
#' is a closely related measure.
#'
#' @param x,y Numeric vectors
#' @return The distance between \code{x} and \code{y}. The modified mean
#' character difference is undefined if all elements in \code{x} and \code{y}
#' are zero, in which case we return \code{NaN}.
#' @details
#' For vectors \code{x} and \code{y}, the Manhattan distance is given by
#' \deqn{d(x, y) = \sum_i |x_i - y_i|.} Relation of \code{manhattan()} to
#' other definitions:
#' \itemize{
#' \item Equivalent to R's built-in \code{dist()} function with
#' \code{method = "manhattan"}.
#' \item Equivalent to \code{vegdist()} with \code{method = "manhattan"}.
#' \item Equivalent to the \code{cityblock()} function in
#' \code{scipy.spatial.distance}.
#' \item Equivalent to the \code{manhattan} calculator in Mothur.
#' \item Equivalent to \eqn{D_7}{D_7} in Legendre & Legendre.
#' \item Whittaker's index of association (\eqn{D_9}{D_9} in Legendre &
#' Legendre) is the Manhattan distance computed after transforming to
#' proportions and dividing by 2.
#' }
#'
#' The mean character difference is the Manhattan distance divided by the
#' length of the vectors. It was proposed by Cain and Harrison in 1958.
#' Relation of \code{mean_character_difference()} to other definitions:
#' \itemize{
#' \item Equivalent to \eqn{D_8} in Legendre & Legendre.
#' \item For binary data, equivalent to \eqn{1 - S_1} in Legendre & Legendre,
#' where \eqn{S_1} is the simple matching coefficient.
#' }
#'
#' The modified mean character difference is the Manhattan distance divided by
#' the number elements where either \code{x} or \code{y} (or both) are nonzero.
#' Relation of \code{modified_mean_character_difference()} to other
#' definitions:
#' \itemize{
#' \item Equivalent to \eqn{D_{19}} in Legendre & Legendre.
#' \item Equivalent to \code{vegdist()} with \code{method = "altGower"}.
#' \item For binary data, it is equivalent to the Jaccard distance.
#' }
#' @references
#' Cain AJ, Harrison GA. An analysis of the taxonomist's judgment of affinity.
#' Proceedings of the Zoological Society of London 1958;131:85-98.
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' y <- c(10, 2, 0, 1, 1, 0)
#' manhattan(x, y)
#' # Whittaker's index of association
#' manhattan(x / sum(x), y / sum(y)) / 2
#'
#' mean_character_difference(x, y)
#' # Simple matching coefficient for presence/absence data
#' # Should be 2 / 6
#' mean_character_difference(x > 0, y > 0)
#'
#' modified_mean_character_difference(x, y)
#' # Jaccard distance for presence/absence data
#' modified_mean_character_difference(x > 0, y > 0)
#' jaccard(x, y)
#' @export
manhattan <- function (x, y) {
sum(abs(y - x))
}
#' @rdname manhattan
#' @export
mean_character_difference <- function (x, y) {
manhattan(x, y) / length(x)
}
#' @rdname manhattan
#' @export
modified_mean_character_difference <- function (x, y) {
pp <- sum((x > 0) | (y > 0))
manhattan(x, y) / pp
}
#' Canberra and related distances
#'
#' The Canberra distance and Clark's coefficient of divergence are measures
#' that use the absolute difference over the sum for each element of the
#' vectors.
#'
#' @param x,y Numeric vectors
#' @return The Canberra distance or Clark's coefficient of divergence. If every
#' element in \code{x} and \code{y} is zero, Clark's coefficient of
#' divergence is undefined, and we return \code{NaN}.
#' @details
#' For vectors \code{x} and \code{y}, the Canberra distance is defined as
#' \deqn{d(x, y) = \sum_i \frac{|x_i - y_i|}{x_i + y_i}.} Elements where
#' \eqn{x_i + y_i = 0} are not included in the sum. Relation of
#' \code{canberra()} to other definitions:
#' \itemize{
#' \item Equivalent to R's built-in \code{dist()} function with
#' \code{method = "canberra"}.
#' \item Equivalent to the \code{vegdist()} function with
#' \code{method = "canberra"}, multiplied by the number of entries where
#' \code{x > 0}, \code{y > 0}, or both.
#' \item Equivalent to the \code{canberra()} function in
#' \code{scipy.spatial.distance} for positive vectors. They take the
#' absolute value of \eqn{x_i} and \eqn{y_i} in the denominator.
#' \item Equivalent to the \code{canberra} calculator in Mothur, multiplied
#' by the total number of species in \code{x} and \code{y}.
#' \item Equivalent to \eqn{D_{10}} in Legendre & Legendre.
#' }
#'
#' Clark's coefficient of divergence involves summing squares and taking a
#' square root afterwards:
#' \deqn{
#' d(x, y) = \sqrt{
#' \frac{1}{n} \sum_i \left( \frac{x_i - y_i}{x_i + y_i} \right)^2
#' },}
#' where \eqn{n} is the number of elements where \code{x > 0}, \code{y > 0}, or
#' both. Relation of \code{clark_coefficient_of_divergence()} to other
#' definitions:
#' \itemize{
#' \item Equivalent to \eqn{D_{11}}{D_11} in Legendre & Legendre.
#' }
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' y <- c(10, 2, 0, 1, 1, 0)
#' canberra(x, y)
#' clark_coefficient_of_divergence(x, y)
#' @export
canberra <- function (x, y) {
xy_diff <- abs(x - y)
xy_sum <- x + y
keep <- xy_sum != 0
sum(xy_diff[keep] / xy_sum[keep])
}
#' @rdname canberra
#' @export
clark_coefficient_of_divergence <- function (x, y) {
xy_diff <- x - y
xy_sum <- x + y
keep <- xy_sum != 0
pp <- sum(keep)
sqrt(sum((xy_diff[keep] / xy_sum[keep]) ^ 2) / pp)
}
#' Chebyshev distance
#'
#' The Chebyshev distance is the maximum absolute difference between the vector
#' elements.
#'
#' @param x,y Numeric vectors
#' @return The Chebyshev distance between \code{x} and \code{y}.
#' @details
#' For vectors \code{x} and \code{y}, the Chebyshev distance is defined as
#' \deqn{d(x, y) = \max_i |x_i - y_i|.} Relation to other definitions:
#' \itemize{
#' \item Equivalent to the \code{chebyshev()} function in
#' \code{scipy.spatial.distance}.
#' }
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' y <- c(10, 2, 0, 1, 1, 0)
#' chebyshev(x, y) # should be 5
#' @export
chebyshev <- function (x, y) {
max(abs(x - y))
}
#' Correlation and cosine distance
#'
#' The correlation and cosine distances, which are derived from the dot
#' product of the two vectors.
#'
#' @param x,y Numeric vectors
#' @return The correlation or cosine distance. These are undefined if either
#' \code{x} or \code{y} contain all zero elements, that is, if \eqn{|x| = 0}
#' or \eqn{|y| = 0}. In this case, we return \code{NaN}.
#' @details
#' For vectors \code{x} and \code{y}, the cosine distance is defined as the
#' cosine of the angle between the vectors,
#' \deqn{d(x, y) = 1 - \frac{x \cdot y}{|x| |y|},} where \eqn{|x|} is the
#' magnitude or L2 norm of the vector, \eqn{|x| = \sqrt{\sum_i x_i^2}}.
#' Relation to other definitions:
#' \itemize{
#' \item Equivalent to the \code{cosine()} function in
#' \code{scipy.spatial.distance}.
#' }
#'
#' The correlation distance is simply equal to one minus the Pearson
#' correlation between vectors. Mathematically, it is equivalent to the cosine
#' distance between the vectors after they are centered (\eqn{x - \bar{x}}).
#' Relation to other definitions:
#' \itemize{
#' \item Equivalent to the \code{correlation()} function in
#' \code{scipy.spatial.distance}.
#' \item Equivalent to the \code{1 - mempearson} calculator in Mothur.
#' }
#' @examples
#' x <- c(2, 0)
#' y <- c(5, 5)
#' cosine_distance(x, y)
#' # The two vectors form a 45 degree angle, or pi / 4
#' 1 - cos(pi / 4)
#'
#' v <- c(3.5, 0.1, 1.4)
#' w <- c(3.3, 0.5, 0.9)
#' correlation_distance(v, w)
#' 1 - cor(v, w)
#' @export
correlation_distance <- function (x, y) {
x <- x - mean(x)
y <- y - mean(y)
cosine_distance(x, y)
}
#' @rdname correlation_distance
#' @export
cosine_distance <- function (x, y) {
xy_dot <- sum(x * y)
x_norm <- sqrt(sum(x ^ 2))
y_norm <- sqrt(sum(y ^ 2))
1 - xy_dot / (x_norm * y_norm)
}
#' Bray-Curtis distance
#'
#' The Bray-Curtis distance is the Manhattan distance divided by the sum of
#' both vectors.
#'
#' @param x,y Numeric vectors
#' @return The Bray-Curtis distance between \code{x} and \code{y}. The
#' Bray-Curtis distance is undefined if the sum of all elements in \code{x}
#' and \code{y} is zero, in which case we return \code{NaN}.
#' @details
#' For two vectors \code{x} and \code{y}, the Bray-Curtis distance is defined
#' as \deqn{d(x, y) = \frac{\sum_i |x_i - y_i|}{\sum_i x_i + y_i}.} The
#' Bray-Curtis distance is connected to many other distance measures in this
#' package; we try to list some of the more important connections here. Relation
#' to other definitions:
#' \itemize{
#' \item Equivalent to \code{vegdist()} with \code{method = "bray"}.
#' \item Equivalent to the \code{braycurtis()} function in
#' \code{scipy.spatial.distance} for positive vectors. They take the
#' absolute value of \eqn{x_i + y_i} in the denominator.
#' \item Equivalent to the \code{braycurtis} and \code{odum} calculators in
#' Mothur.
#' \item Equivalent to \eqn{D_{14} = 1 - S_{17}}{D_14 = 1 - S_17} in
#' Legendre & Legendre.
#' \item The Bray-Curtis distance on proportions is equal to half the
#' Manhattan distance.
#' \item The Bray-Curtis distance on presence/absence vectors is equal to the
#' Sorenson index of dissimilarity.
#' }
#' @examples
#' x <- c(15, 6, 4, 0, 3, 0)
#' y <- c(10, 2, 0, 1, 1, 0)
#' bray_curtis(x, y)
#'
#' # For proportions, equal to half the Manhattan distance
#' bray_curtis(x / sum(x), y / sum(y))
#' manhattan(x / sum(x), y / sum(y)) / 2
#' @export
bray_curtis <- function (x, y) {
sum(abs(x - y)) / sum(x + y)
}
#' Weighted Kulczynski distance
#'
#' The quantitative version of the second Kulczynski index
#'
#' @param x,y Numeric vectors
#' @return The weighted Kulczynski distance between \code{x} and \code{y}. The
#' distance is undefined if the sum of \code{x} or the sum of \code{y} is
#' zero, in which case we return \code{NaN}.
#' @details
#' The quantitative version of the second Kulczynski index is defined as
#' \deqn{
#' d(x, y) = 1 - \frac{1}{2} \left (
#' \frac{\sum_i \min{(x_i, y_i)}}{\sum_i x_i} +
#' \frac{\sum_i \min{(x_i, y_i)}}{\sum_i y_i}
#' \right ).
#' }
#' Relation of \code{weighted_kulczynski_second()} to other definitions:
#' \itemize{
#' \item Equivalent to \code{vegdist()} with \code{method = "kulczynski"}.
#' \item Equivalent to \code{structkulczynski} in Mothur.
#' \item Equivalent to \eqn{1 - S_{18}} in Legendre & Legendre.
#' }
#' @export
weighted_kulczynski_second <- function (x, y) {
xy_min <- sum(pmin(x, y))
1 - (1 / 2) * (xy_min / sum(x) + xy_min / sum(y))
}
#' Minkowski distance
#'
#' The Minkowski metric is a generalized form of Euclidean (p=2) and Manhattan
#' (p=1) distance.
#'
#' @param x,y Numeric vectors.
#' @param p Exponent parameter, a single number greater than zero.
#' @return The Minkowski distance between \code{x} and \code{y}.
#' @details
#' For vectors \code{x} and \code{y}, the Minkowski distance is defined as
#' \deqn{d(x, y) = \left( \sum_i |x_i - y_i|^p \right)^{1/p}.} Relation to
#' other definitions:
#' \itemize{
#' \item Equivalent to R's built-in \code{dist()} function with
#' \code{method = "minkowski"}.
#' \item Equivalent to the \code{minkowski()} function in
#' \code{scipy.spatial.distance}.
#' \item Equivalent to \eqn{D_6}{D_6} in Legendre & Legendre.
#' }
#' The default value of \code{p = 1} makes this distance equal to the Manhattan
#' distance.
#' @export
minkowski <- function (x, y, p = 1) {
stopifnot(p > 0, length(p) == 1)
sum(abs(x - y) ^ p) ^ (1 / p)
}
#' The Morisita index and Horn-Morisita index
#'
#' The Morisita and the Horn-Morisita indices measure the probability that
#' individuals drawn one from each vector will belong to different species,
#' relative to drawing from each vector separately. The Morisita index is
#' formulated for count data only, whereas the Horn-Morisita index can be
#' used with transformed counts or proportions.
#'
#' @param x,y Numeric vectors
#' @return The Morisita or Horn-Morisita index between \code{x} and \code{y}.
#' Both are undefined if \code{x} or \code{y} have no nonzero elements, in
#' which case we return \code{NaN}.
#' @details
#' For two vectors \code{x} and \code{y}, the Morisita index of dissimilarity
#' is
#' \deqn{d(x,y) = 1 - \frac{2 \sum_i x_i y_i}{(\lambda_x + \lambda_y) N_x N_y},}
#' where \deqn{\lambda_x = \frac{\sum_i x_i (x_i - 1)}{N_x (N_x - 1)}} and
#' \eqn{N_x = \sum_i x_i} The formula for \eqn{\lambda_x} is the unbiased
#' estimate for the probability of drawing two individuals of the same species
#' from \code{x}, without replacement. The correction for sampling without
#' replacement only makes sense for species count data.
#'
#' Relation of \code{morisita()} to other definitions:
#' \itemize{
#' \item Equivalent to \code{vegdist()} with \code{method = "morisita"}.
#' }
#'
#' Horn (1966) reformulated the index to use the equations for sampling with
#' replacement in \eqn{\lambda_x} and \eqn{\lambda_y}:
#' \deqn{\lambda_x = \frac{\sum_i x_i^2}{N_x^2}} With this modification,
#' the index is valid for proportions or transformed count data.
#'
#' Relation of \code{horn_morisita()} to other definitions:
#' \itemize{
#' \item Equivalent to \code{vegdist()} with \code{method = "horn"}.
#' \item Equivalent to the \code{morisitahorn} calculator in Mothur.
#' }
#' @references
#' Mosrisita M. Measuring of interspecific association and similarity between
#' communities. Memoirs of the Faculty of Science, Kyushu Univ., Series E
#' (Biology). 1959;3:65-80.
#'
#' Horn HS. Measurement of "Overlap" in Comparative Ecological Studies. The
#' American Naturalist, 1966;100(914):419-424.
#' @export
morisita <- function (x, y) {
# Vegan docs not consistent with the paper, but implementation is correct
Nx <- sum(x)
Ny <- sum(y)
# Eqn 1
lambda_x <- sum(x * (x - 1)) / (Nx * (Nx - 1))
# Eqn 2
lambda_y <- sum(y * (y - 1)) / (Ny * (Ny - 1))
# Eqn 3
1 - 2 * sum(x * y) / ((lambda_x + lambda_y) * Nx * Ny)
}
#' @rdname morisita
#' @export
horn_morisita <- function (x, y) {
Nx <- sum(x)
Ny <- sum(y)
# Horn defines new lambda estimates in middle of page 420
lambda_x <- sum(x ^ 2) / (Nx ^ 2)
lambda_y <- sum(y ^ 2) / (Ny ^ 2)
1 - 2 * sum(x * y) / ((lambda_x + lambda_y) * Nx * Ny)
}
#' Binomial deviance and CY index of dissimilarity
#'
#' The binomial deviance dissimilarity and the CY (or Cao) index of
#' dissimilarity were created to compare species counts at sites with moderate
#' to large differences.
#'
#' @param x,y Numeric vectors
#' @param base Base of the logarithm
#' @param min_value Replacement for zero or near-zero values. Values less than
#' \code{min_value} are replaced with \code{min_value}.
#' @return The Binomial deviance or CY index of dissimilarity. The CY index is
#' undefined if all elements of \code{x} and \code{y} are zero, in which case
#' we return \code{NaN}.
#' @details
#' Both of these measures were designed to be used with whole-numbered counts,
#' and may not make sense for comparing normalized vectors or vectors of
#' species proportions.
#'
#' For two vectors \code{x} and \code{y}, the binomial deviance dissimilarity
#' is
#' \deqn{
#' d(x,y) = \sum_i{
#' \frac{1}{n_i}
#' \left (
#' x_i \log{\frac{x_i}{n_i}} +
#' y_i \log{\frac{y_i}{n_i}} -
#' (x_i + y_i) log{2}
#' \right )
#' },
#' }
#' where \eqn{n_i = x_i + y_i}. This value is the weighted average of the
#' deviance for each species, under a binomial model where the expected counts
#' are \eqn{n_i / 2} at each site. It was proposed by Anderson and Millar in
#' 2004. Relation to other definitions:
#' \itemize{
#' \item Equivalent to vegdist() with method = "binomial".
#' }
#'
#' The CY index was proposed by Cao, Williams, and Bark in 1997. For two
#' vectors \code{x} and \code{y}, the CY index is
#' \deqn{
#' d(x,y) = \frac{1}{N} \sum_i
#' \left (
#' \frac{
#' (x_i + y_i) \log_{10} ( \frac{x_i + y_i}{2} ) -
#' x_i \log_{10}(y_i) - y_i \log_{10}(x_i)
#' }{
#' x_i + y_i
#' }
#' \right ),
#' }
#' where \eqn{N} is the total number of species in vectors \eqn{x} and \eqn{y}.
#' Double zeros are not considered in the measure.
#'
#' When either \eqn{x_i} or \eqn{y_i} are zero, they need to be replaced by
#' another value in the CY index to avoid infinities. Cao suggested replacing
#' zero values with \eqn{0.1}, which is one log lower than the minimum value
#' for whole-numbered counts. Here, we use a \code{min_value} argument to allow
#' the user set a lower limit on the values. For vectors of species counts,
#' this function follows the formulation of Cao by default.
#'
#' Relation of the CY index to other definitions:
#' \itemize{
#' \item Equivalent to the \code{vegdist()} function with
#' \code{method = "cao"}, if \code{base = exp(1)}.
#' }
#' @references
#' Anderson MJ, Millar RB. Spatial variation and effects of habitat on
#' temperate reef fish assemblages in northeastern New Zealand. Journal of
#' Experimental Marine Biology and Ecology 2004;305:191–221.
#'
#' Cao Y, Williams WP, Bark AW. Similarity measure bias in river benthic
#' Aufwuchs community analysis. Water Environment Research 1997;69(1):95-106.
#' @export
binomial_deviance <- function (x, y) {
x_is_defined_and_zero <- (x == 0) %in% TRUE
y_is_defined_and_zero <- (y == 0) %in% TRUE
n <- x + y
n_is_defined_and_zero <- x_is_defined_and_zero & y_is_defined_and_zero
# Formula at top of page 199
t1 <- ifelse(x_is_defined_and_zero, 0, x * log(x / n))
t2 <- ifelse(y_is_defined_and_zero, 0, y * log(y / n))
t3 <- n * log(2)
# BUG IN VEGAN?!?!?!
# Formula in paper subtracts t3, vegan function adds this term
terms <- ifelse(n_is_defined_and_zero, 0, (1 / n) * (t1 + t2 + t3))
sum(terms)
}
#' @rdname binomial_deviance
#' @export
cy_dissimilarity <- function (x, y, base = 10, min_value = 0.1) {
# Remove double zeros
keep <- (x > 0) | (y > 0)
x <- x[keep]
y <- y[keep]
# Substitute individual zeros with 0.1, just like vegan
x <- ifelse(x > min_value, x, min_value)
y <- ifelse(y > min_value, y, min_value)
N <- length(x)
xy_sum <- x + y
t1 <- xy_sum * log(xy_sum / 2, base = base)
t2 <- x * log(y, base = base)
t3 <- y * log(x, base = base)
# Formula 12 in the Cao paper
(1 / N) * sum((t1 - t2 - t3) / xy_sum)
}
#' Ruzicka or weighted Jaccard distance
#'
#' @param x,y Numeric vectors.
#' @return The Ruzicka distance between \code{x} and \code{y}. The distance is
#' not defined if all elements in \code{x} and \code{y} are zero, and we
#' return \code{NaN} in this case.
#' @details
#' For vectors \code{x} and \code{y}, the Ruzicka distance is defined as
#' \deqn{d(x, y) = 1 - \frac{\sum_i \min(x, y)}{\sum_i \max(x, y)}.} Relation
#' to other definitions:
#' \itemize{
#' \item Equivalent to vegdist() with method = "jaccard".
#' \item Related to the Bray-Curtis distance,
#' \eqn{d_r = 2 d_{bc} / (1 + d_{bc})}.
#' }
#' @export
ruzicka <- function (x, y) {
1 - sum(pmin(x, y)) / sum(pmax(x, y))
}
#' Beta diversity for presence/absence data
#'
#' These functions transform the input vectors to binary or presence/absence
#' format, then compute a distance or dissimilarity.
#'
#' @param x,y Numeric vectors
#' @return The dissimilarity between \code{x} and \code{y}, based on
#' presence/absence. The Jaccard, Sorenson, Sokal-Sneath, Yule, and both
#' Kulczynski dissimilarities are not defined if both \code{x} and \code{y}
#' have no nonzero elements. In addition, the second Kulczynski index and the
#' Yule index of dissimilarity are not defined if one of the vectors has no
#' nonzero elements. We return \code{NaN} for undefined values.
#' @details
#' Many of these indices are covered in Koleff et al. (2003), so we adopt their
#' notation. For two vectors \code{x} and \code{y}, we define three quantities:
#' \itemize{
#' \item \eqn{a} is the number of species that are present in both \code{x}
#' and \code{y},
#' \item \eqn{b} is the number of species that are present in \code{y} but
#' not \code{x},
#' \item \eqn{c} is the number of species that are present in \code{x} but
#' not \code{y}, and
#' \item \eqn{d} is the number of species absent in both vectors.
#' }
#' The quantity \eqn{d} is seldom used in ecology, for good reason. For
#' details, please see the discussion on the "double zero problem," in section
#' 2 of chapter 7.2 in Legendre & Legendre.
#'
#' The \emph{Jaccard} index of dissimilarity is \eqn{1 - a / (a + b + c)}, or
#' one minus the proportion of shared species, counting over both samples
#' together. Relation of \code{jaccard()} to other definitions:
#' \itemize{
#' \item Equivalent to R's built-in \code{dist()} function with
#' \code{method = "binary"}.
#' \item Equivalent to \code{vegdist()} with \code{method = "jaccard"}
#' and \code{binary = TRUE}.
#' \item Equivalent to the \code{jaccard()} function in
#' \code{scipy.spatial.distance}, except that we always convert vectors to
#' presence/absence.
#' \item Equivalent to \eqn{1 - S_7} in Legendre & Legendre.
#' \item Equivalent to \eqn{1 - \beta_j}, as well as \eqn{\beta_{cc}}, and
#' \eqn{\beta_g} in Koleff (2003).
#' }
#'
#' The \emph{\enc{Sørenson}{Sorenson}} or \emph{Dice} index of dissimilarity is
#' \eqn{1 - 2a / (2a + b + c)}, or one minus the average proportion of shared
#' species, counting over each sample individually. Relation of
#' \code{sorenson()} to other definitions:
#' \itemize{
#' \item Equivalent to the \code{dice()} function in
#' \code{scipy.spatial.distance}, except that we always convert vectors to
#' presence/absence.
#' \item Equivalent to the \code{sorclass} calculator in Mothur, and to
#' \code{1 - whittaker}.
#' \item Equivalent to \eqn{D_{13} = 1 - S_8}{D_13 = 1 - S_8} in Legendre &
#' Legendre.
#' \item Equivalent to \eqn{1 - \beta_{sor}} in Koleff (2003). Also
#' equivalent to Whittaker's beta diversity
#' (the second definition, \eqn{\beta_w = (S / \bar{a}) - 1}), as well as
#' \eqn{\beta_{-1}}, \eqn{\beta_t}, \eqn{\beta_{me}}, and
#' \eqn{\beta_{hk}}.
#' }
#'
#' I have not been able to track down the original reference for the first and
#' second Kulczynski indices, but we have good formulas from Legendre &
#' Legendre. The \emph{first Kulczynski index} is \eqn{1 - a / (b + c)}, or
#' one minus the ratio of shared to unshared species.
#'
#' Relation of \code{kulczynski_first} to other definitions:
#' \itemize{
#' \item Equivalent to \eqn{1 - S_{12}}{1 - S_12} in Legendre & Legendre.
#' \item Equivalent to the \code{kulczynski} calculator in Mothur.
#' }
#'
#' Some people refer to the \emph{second Kulczynski index} as the
#' Kulczynski-Cody index. It is defined as one minus the average proportion of
#' shared species in each vector,
#' \deqn{
#' d = 1 - \frac{1}{2} \left ( \frac{a}{a + b} + \frac{a}{a + c} \right ).
#' }
#' Relation of \code{kulczynski_second} to other definitions:
#' \itemize{
#' \item Equivalent to \eqn{1 - S_{13}}{1 - S_13} in Legendre & Legendre.
#' \item Equivalent to the \code{kulczynskicody} calculator in Mothur.
#' \item Equivalent to one minus the Kulczynski similarity in Hayek (1994).
#' \item Equivalent to \code{vegdist()} with \code{method = "kulczynski"} and
#' \code{binary = TRUE}.
#' }
#'
#' The \emph{Rogers-Tanimoto} distance is defined as
#' \eqn{(2b + 2c) / (a + 2b + 2c + d)}. Relation of \code{rogers_tanimoto()}
#' to other definitions:
#' \itemize{
#' \item Equivalent to the \code{rogerstanimoto()} function in
#' \code{scipy.spatial.distance}, except that we always convert vectors to
#' presence/absence.
#' \item Equivalent to \eqn{1 - S_2}{1 - S_2} in Legendre & Legendre.
#' }
#'
#' The \emph{Russel-Rao} distance is defined
#' \eqn{(b + c + d) / (a + b + c + d)}, or the fraction of elements not present
#' in both vectors, counting double absences. Relation of \code{russel_rao()} to
#' other definitions:
#' \itemize{
#' \item Equivalent to the \code{russelrao()} function in
#' \code{scipy.spatial.distance}, except that we always convert vectors to
#' presence/absence.
#' \item Equivalent to \eqn{1 - S_{11}}{1 - S_11} in Legendre & Legendre.
#' }
#'
#' The \emph{Sokal-Michener} distance is defined as
#' \eqn{(2b + 2c) / (a + 2b + 2c + d)}. Relation of \code{sokal_michener()} to
#' other definitions:
#' \itemize{
#' \item Equivalent to the \code{sokalmichener()} function in
#' \code{scipy.spatial.distance}, except that we always convert vectors to
#' presence/absence.
#' }
#'
#' The \emph{Sokal-Sneath} distance is defined as
#' \eqn{(2b + 2c) / (a + 2b + 2c)}. Relation of \code{sokal_sneath()} to other
#' definitions:
#' \itemize{
#' \item Equivalent to the \code{sokalsneath()} function in
#' \code{scipy.spatial.distance}, except that we always convert vectors to
#' presence/absence.
#' \item Equivalent to the \code{anderberg} calculator in Mothur.
#' \item Equivalent to \eqn{1 - S_{10}}{1 - S_10} in Legendre & Legendre.
#' }
#'
#' The \emph{Yule} dissimilarity is defined as \eqn{2bc / (ad + bc)}. Relation
#' of \code{yule_dissimilarity()} to other definitions:
#' \itemize{
#' \item Equivalent to the \code{yule()} function in
#' \code{scipy.spatial.distance}, except that we always convert vectors to
#' presence/absence.
#' \item Equivalent to \eqn{1 - S}, where \eqn{S} is the Yule coefficient
#' in Legendre & Legendre.
#' }
#' @export
jaccard <- function (x, y) {
x <- x > 0
y <- y > 0
sum(xor(x, y)) / sum(x | y)
}
#' @rdname jaccard
#' @export
sorenson <- function (x, y) {
x <- x > 0
y <- y > 0
a <- sum(x & y)
bc <- sum(xor(x, y))
bc / (2 * a + bc)
}
#' @rdname jaccard
#' @export
kulczynski_first <- function (x, y) {
x <- x > 0
y <- y > 0
a <- sum(x & y)
bc <- sum(xor(x, y))
1 - a / bc
}
#' @rdname jaccard
#' @export
kulczynski_second <- function (x, y) {
x <- x > 0
y <- y > 0
a <- sum(x & y)
1 - 0.5 * (a / sum(x) + a / sum(y))
}
#' @rdname jaccard
#' @export
rogers_tanimoto <- function (x, y) {
x <- x > 0
y <- y > 0
a <- sum(x & y)
bc <- sum(xor(x, y))
d <- sum((!x) & (!y))
2 * bc / (a + d + 2 * bc)
}
#' @rdname jaccard
#' @export
russel_rao <- function (x, y) {
x <- x > 0
y <- y > 0
a <- sum(x & y)
n <- length(x)
(n - a) / n
}
#' @rdname jaccard
#' @export
sokal_michener <- function (x, y) {
x <- x > 0
y <- y > 0
a <- sum(x & y)
bc <- sum(xor(x, y))
d <- sum((!x) & (!y))
2 * bc / (a + d + 2 * bc)
}
#' @rdname jaccard
#' @export
sokal_sneath <- function (x, y) {
x <- x > 0
y <- y > 0
a <- sum(x & y)
bc <- sum(xor(x, y))
2 * bc / (a + 2 * bc)
}
#' @rdname jaccard
#' @export
yule_dissimilarity <- function (x, y) {
x <- x > 0
y <- y > 0
a <- sum(x & y)
b <- sum((!x) & y)
c <- sum(x & (!y))
d <- sum((!x) & (!y))
2 * b * c / (a * d + b * c)
}
#' Hamming distance
#'
#' The Hamming distance is the number of positions where the values are
#' different.
#'
#' @param x,y Numeric vectors.
#' @return The Hamming distance between \code{x} and \code{y}.
#' @details
#' For vectors \code{x} and \code{y}, the Hamming distance is defined as
#' \deqn{d(x, y) = \sum_i [x_i \neq y_i],} where the quantity in the brackets
#' is 1 if the elements are not equal, and zero if the elements are equal.
#' Relation to other definitions:
#' \itemize{
#' \item The \code{hamming()} function in \code{scipy.spatial.distance}
#' divides the result by the vector length. Our function is equivalent to
#' the SciPy version multiplied by the vector length.
#' \item Equivalent to the \code{hamming} calculator in Mothur for
#' presence/absence vectors.
#' }
#' @export
hamming <- function (x, y) {
sum(x != y)
}
# Scipy notes:
# # Distance functions for continuous vectors
# braycurtis implemented as bray_curtis
# canberra implemented
# chebyshev implemented
# cityblock implemented as manhattan
# correlation implemented as correlation_distance
# cosine implemented
# euclidean implemented
# jensenshannon not implemented
# mahalanobis not implemented, available in R stats module
# minkowski implemented
# seuclidean not implemented
# sqeuclidean not implemented
# wminkowski not implemented
# # Distance functions for presence/absence vectors
# *** We convert the vectors to presence/absence, so our results do not match
# *** the SciPy tests when elements are greater than 1. The SciPy
# *** implementation is arguably wrong here.
# dice implemented as sorenson
# hamming implemented differently
# jaccard implemented
# kulsinski not implemented, not sure measure is correct in SciPy
# rogerstanimoto implemented as rogers_tanimoto
# russelrao implemented as russel_rao
# sokalmichener implemented as sokal_michener
# sokalsneath implemented as sokal_sneath
# yule implemented as yule_dissimilarity
# Mothur notes
# https://www.mothur.org/wiki/Calculators
# ## Community membership
# anderberg implemented as sokal_sneath
# hamming implemented as hamming for presence/absence vectors
# jclass implemented as jaccard (Mothur docs give formula for similarity)
# jest (Jaccard with Chao estimates) not implemented
# kulczynski implemented as kulczynski_first
# kulczynskicody implemented as kulczynski_second
# lennon not implemented, TODO
# memchi2 not implemented, needs full matrix
# memchord not implemented, Mothur docs seem off
# If the terms in the denominator were squared, this would be the chord
# distance for presence/absence data. Must check source code.
# memeuclidean equivalent to euclidean on presence/absence vectors
# mempearson equivalent to 1 - correlation_distance
# ochiai not implemented, TODO
# sorclass implemented as sorenson
# sorest not implemented
# whittaker equivalent to 1 - sorenson
# ## Community structure
# braycurtis implemented as bray_curtis
# canberra implemented, see note
# gower not implemented, needs full matrix
# hellinger implemented
# jabund (Chao-Jaccard or abundance-Jaccard) not implemented, TODO
# manhattan implemented
# morisitahorn implemented as horn_morisita
# odum is equivalent to bray_curtis
# soergel not implemented, TODO
# sorabund (Chao-Sorenson or abundance-Sorenson) not implemented, TODO
# spearman not implemented
# speciesprofile equivalent to Euclidean on proportions, noted
# structchi2 not implemented, needs full matrix
# structchord implemented as chord
# structeuclidean implemented as euclidean
# structkulczynski implemented as weighted_kulczynski_second
# structpearson implemented as correlation_distance
# thetan (community Jaccard index of Smith aka the Θn of Yue) not implemented,
# TODO
# thetayc (Yue & Clayton measure) not implemented, TODO
# Koleff notes:
# 1. \beta_w (Whittaker's beta diversity) implemented as sorenson
# 2. \beta_{-1} implemented as sorenson
# 3. \beta_c not implemented
# 4. \beta_{wb} (Weiher & Boylen) not implemented
# 5. \beta_r (Routledge) not implemented
# 6. \beta_I not implemented
# 7. \beta_e not implemented
# 8. \beta_t implemented as sorenson
# 9. \beta_{me} implemented as sorenson
# 10. \beta_j implemented as jaccard
# 11. \beta_{sor} implemented as sorenson
# 12. \beta_m (Magurran) not implemented
# 13. \beta_{-2} not implemented
# 14. \beta_{co} not implemented
# 15. \beta_{cc} implemented as jaccard
# 16. \beta_g implemented as jaccard
# 17. \beta_{-3} not implemented
# 18. \beta_1 not implemented
# 19. (Williams) not implemented
# 20. \beta_{hk} (Harte & Kinzig) implemented as sorenson
# 21. \beta_{rlb} not implemented
# 22. \beta_{sim} not implemented
# 23. \beta_{gl} not implemented
# 24. \beta_z not implemented
# Vegan notes:
# # TODO: document these functions better
# # Methods in vegdist
# euclidean implemented
# euclidean with binary = TRUE not implemented
# manhattan implemented
# manhattan with binary = TRUE not implemented
# gower not implemented, needs full matrix
# gower with binary = TRUE not implemented
# altGower implemented as modified_mean_character_difference
# altGower with binary = TRUE not implemented
# canberra implemented
# canberra with binary = TRUE not implemented
# clark not implemented?
# clark with binary = TRUE not implemented
# bray implemented as bray_curtis
# bray with binary = TRUE not implemented
# kulczynski implemented as weighted_kulczynski_second
# kulczynski with binary = TRUE implemented as kulczynski_second
# morisita implemented
# morisita with binary = TRUE can't be calculated
# horn implemented as horn_morisita
# horn with binary = TRUE not implemented
# binomial implemented as millar (from Anderson & Millar 2004)
# cao implemented
# cao with binary = TRUE not implemented
# jaccard implemented as ruzicka
# jaccard with binary = TRUE implemented as jaccard
# # Other stuff
# TODO: Mountford
# TODO: Raup
# TODO: Chao-Jaccard, vegan uses some correction from the paper
# Legendre & Legendre notes:
# D_1 implemented as euclidean
# D_2 (average distance) implemented as rms_distance
# D_3 implemented as chord
# D_4 implemented as geodesic_metric
# D_5 (Mahalanobis distance) not implemented, available in R stats module
# D_6 implemented as minkowski
# D_7 implemented as manhattan
# D_8 implemented as mean_character_difference
# D_19 implemented as modified_mean_character_difference
# D_9 (Whittaker's index of association) not implemented, noted under manhattan
# D_10 implemented as canberra
# D_11 implemented as clark_coefficient_of_divergence
# D_12 (Pearson coefficient of racial likeness) not implemented
# D_18 (distance between species profiles) not implemented,
# noted under euclidean
# D_15 (Chi-square metric) not implemented, needs full matrix
# D_16 (Chi-square distance) not implemented, needs full matrix
# D_17 implemented as hellinger
# D_13 implemented as sorenson
# D_14 implemented as bray_curtis
# S_1 (simple matching coefficient) is the mean_character_difference
# S_2 (coefficient of Rogers & Tanimoto) implemented as rogers_tanimoto
# S_3, S_4, S_5, S_6 (Sokal and Sneath) not implemented
# Hamann coefficient not implemented
# Yule coefficient implemented as yule_dissimilarity
# Pearson's phi not implemented
# S_7 implemented as jaccard
# S_8 implemented as sorenson
# S_9 = 3a / (3a + b + c) not implemented
# S_10 implemented as sokal_sneath
# S_11 implemented as russel_rao
# S_12 implemented as kulczynski_first
# S_13 implemented as kulczynski_second
# S_14 (name unclear) not implemented. It should be noted that S_14 as a
# distance is proportional to sqrt of chord or Hellinger for binary data.
# S_26 (name unclear) not implemented
# S_15 (Gower coefficient) not implemented. Equivalent to Hamming distance?
# S_16 (Estabrook and Rogers) not implemented
# S_17 (Steinhaus coefficient) implemented as bray_curtis
# S_18 (quantitative Kulcynski) implemented as weighted_kulczynski_second
# S_19 (modified Gower) not implemented
# S_20 (Legendre & Chodorowski) not implemented, also similar to Gower
# S_21 (Chi-square similarity) not implemented, see D_15
# S_22 (Goodall 1) not implemented, needs full matrix
# S_23 (Goodall 2) not implemented, needs full matrix
# Raup-Crick not implemented, available as raupcrick() in vegan
# S_27 (Raup-Crick p-value) not implemented
# Skip rest of Q-mode similarity for now
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.