R/partition.loss.R
In salso: Search Algorithms and Loss Functions for Bayesian Clustering

Documented in ARI binder ID NID NVI omARI omARI.approx partition.loss RI VI VI.lb

#' Compute Partition Loss or the Expectation of Partition Loss
#'
#' Given two partitions \eqn{\pi*} and \eqn{\pi}, these functions compute the
#' specified loss when using \eqn{\pi*} to estimate \eqn{\pi}. Smaller loss
#' values indicate higher concordance between partitions. These functions also
#' compute a Monte Carlo estimate of the expectation for the specified loss
#' based on samples or a pairwise similarity matrix. This function also supports
#' computing approximations to the expectation of several losses. Supported
#' criteria are described below. Some criteria only require the pairwise
#' similarity matrix (as computed, for example, by \code{\link{psm}}) whereas
#' others require samples from a partition distribution. For those criteria that
#' only need the pairwise similarity matrix, posterior samples can still be
#' provided in the \code{truth} argument and the pairwise similarity matrix will
#' automatically be computed as needed.
#'
#' The partition estimation criterion can be specified using the \code{loss}
#' argument, which is either a string or a result of calling the associated
#' functions. These losses are described below: \describe{
#'
#' \item{\code{"binder"}}{Binder. Whereas high values of the Rand index \eqn{R}
#' between \eqn{\pi*} and \eqn{\pi} correspond to high concordance between the
#' partitions, the N-invariant Binder loss \eqn{L} for a partition \eqn{\pi*} in
#' estimating \eqn{\pi} is \eqn{L = (1-R)*(n-1)/n}, meaning that low values
#' correspond to high concordance between the partitions. This package reports
#' the N-invariant Binder loss. The original Binder loss equals the N-invariant
#' Binder loss multiplied by \eqn{n^2 / 2}. Only the pairwise similarity matrix
#' is required for this loss, but samples can be provided. As originally
#' discussed by Binder (1978), two mistakes are possible: 1. Placing two items
#' in separate clusters when in truth they belong to the same cluster, and 2.
#' Placing two items in the same cluster when in truth they belong to separate
#' clusters. The default cost of the first mistake is one, but can be specified
#' with the argument \code{a} in \eqn{[0,2]}. Without loss of generality, the cost of
#' the second mistake is set as \code{2-a}. For a discussion of general
#' weights, see Dahl, Johnson, and Müller (2021). For a discussion of the equal
#' weights case, see also Dahl (2006), Lau and Green (2007), Dahl and Newton
#' (2007), Fritsch and Ickstadt (2009), and Wade and Ghahramani (2018).}
#'
#' \item{\code{"omARI"}}{One Minus Adjusted Rand Index. Computes the expectation
#' of one minus the adjusted Rand index (Hubert and Arabie, 1985). Whereas high
#' values of the adjusted Rand index between \eqn{\pi*} and \eqn{\pi} correspond
#' to high concordance between the partitions, the loss associated with the
#' adjusted Rand index for a partition \eqn{\pi*} in estimating \eqn{\pi} is one
#' minus the adjusted Rand index between the partitions, meaning that low values
#' correspond to high concordance between the partitions. Samples from a
#' partition distribution are required for this loss. See Fritsch and Ickstadt
#' (2009).}
#'
#' \item{\code{"omARI.approx"}}{Approximation of One Minus Adjusted Rand Index.
#' Computes the first-order approximation of the expectation of one minus the
#' adjusted Rand index. The adjusted Rand index involves a ratio and the
#' first-order approximation of the expectation is based on \eqn{E(X/Y) \approx
#' E(X)/E(Y)}. Only the pairwise similarity matrix is required for this
#' criterion, but samples can be provided. See Fritsch and Ickstadt (2009).}
#'
#' \item{\code{"VI"}}{Variation of Information. Computes the expectation of the
#' (generalized) variation of information loss. Samples from a partition
#' distribution are required for this loss. See Meilă (2007), Wade and
#' Ghahramani (2018), and Rastelli and Friel (2018). The original variation of
#' information of Meilă (2007) has been extended to the generalized variation of
#' information of Dahl, Johnson, and  Müller (2021) to allow for unequal
#' weighting of two possible mistakes: 1. Placing two items in separate clusters
#' when in truth they belong to the same cluster, and 2. Placing two items in
#' the same cluster when in truth they belong to separate clusters.  The value
#' \code{a} controls the cost of the first mistake and defaults to one, but can
#' be specified with the argument \code{a} in \eqn{[0,2]}. Without loss of generality,
#' the cost of the second mistake is controlled by \code{2-a}. See Dahl,
#' Johnson, Müller (2021).}
#'
#' \item{\code{"VI.lb"}}{Lower Bound of the Variation of Information. Computes
#' the lower bound of the expectation of the variation of information loss,
#' where the lower bound is obtained by Jensen's inequality. Only the pairwise
#' similarity matrix is required for this criterion, but samples can be
#' provided. See Wade and Ghahramani (2018).}
#'
#' \item{\code{"NVI"}}{Normalized Variation of Information. Computes the
#' expectation of the normalized variation of information loss. Samples from a
#' partition distribution are required for this loss. See Vinh, Epps, and Bailey
#' (2010) and Rastelli and Friel (2018).}
#'
#' \item{\code{"ID"}}{Information Distance. Computes the expectation of the
#' information distance (\eqn{D_{max}}) loss. Samples from a partition
#' distribution are required for this loss. See Vinh, Epps, and Bailey (2010).}
#'
#' \item{\code{"NID"}}{Normalized Information Distance. Computes the expectation
#' of the normalized information distance loss. Samples from a partition
#' distribution are required for this loss. See Vinh, Epps, and Bailey (2010)
#' and Rastelli and Friel (2018).}
#'
#' }
#'
#' The functions \code{\link{RI}} and \code{\link{ARI}} are convenience
#' functions. Note that:
#' \itemize{
#' \item \code{binder(p1, p2, a=1) = ( 1 - RI(p1, p2) )*(n-1)/n}
#' \item \code{omARI(p1, p2) = 1 - ARI(p1, p2)}
#' }
#'
#' @param truth An integer vector of cluster labels for \eqn{n} items
#'   representing the true clustering. Two items are in the same cluster if
#'   their labels are equal. Or, a matrix of \eqn{n} columns where each row is a
#'   clustering.
#' @param estimate An integer vector of cluster labels having the same length as
#'   \code{truth} representing the estimated clustering. Or, a matrix of
#'   \eqn{n} columns where each row is a clustering.
#' @param loss The loss function to use, as indicated by \code{"binder"},
#'   \code{"omARI"}, \code{"VI"}, \code{"NVI"}, \code{"ID"}, \code{"NID"}, or
#'   the result of calling a function with these names. Also supported are
#'   \code{"binder.psm"}, \code{"VI.lb"}, \code{"omARI.approx"}, or the result
#'   of calling a function with these names, in which case \code{x} above can
#'   optionally be a pairwise similarity matrix, i.e., \eqn{n}-by-\eqn{n}
#'   symmetric matrix whose \eqn{(i,j)} element gives the (estimated)
#'   probability that items \eqn{i} and \eqn{j} are in the same subset (i.e.,
#'   cluster) of a partition (i.e., clustering).
#' @param a (Only used for Binder and VI loss) The argument \code{a} is either:
#'   i. a nonnegative scalar in \eqn{[0,2]} giving (for Binder loss) the cost of
#'   placing two items in separate clusters when in truth they belong to the
#'   same cluster, ii. \code{NULL}, in which case \code{a} that maximizes the
#'   expected loss is found, and iii. a list containing the desired number of
#'   clusters (\code{"nClusters"}) when searching for \code{a} that yields this
#'   number of clusters. In all but the first case, one may want to modifying
#'   \code{maxSize} in the \code{\link{salso}} function. To increase the
#'   probability of hitting exactly the desired number of clusters, the
#'   \code{nRuns} in the \code{\link{salso}} function may need to be increased.
#'   Without loss of generality, the cost (under Binder loss) of placing two
#'   items in the same cluster when in truth they belong to separate clusters is
#'   fixed \code{2-a}. For VI, \code{a} has a similar interpretation, although is
#'   not a unit cost. See Dahl, Johnson, Müller (2021).
#'
#' @return A numeric vector.
#'
#' @seealso \code{\link{psm}}, \code{\link{salso}}, \code{\link{dlso}}
#
#' @references
#'
#' W. M. Rand (1971), Objective Criteria for the Evaluation of Clustering
#' Methods. \emph{Journal of the American Statistical Association}, \bold{66}:
#' 846–850.
#'
#' D. A. Binder (1978), Bayesian cluster analysis, \emph{Biometrika} \bold{65},
#' 31-38.
#'
#' L. Hubert and P. Arabie (1985), Comparing Partitions. \emph{Journal of
#' Classification}, \bold{2}, 193–218.
#'
#' D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet
#' Process Mixture Model, in \emph{Bayesian Inference for Gene Expression and
#' Proteomics}, Kim-Anh Do, Peter Müller, Marina Vannucci (Eds.), Cambridge
#' University Press.
#'
#' J. W. Lau and P. J. Green (2007), Bayesian model based clustering procedures,
#' \emph{Journal of Computational and Graphical Statistics} \bold{16}, 526-558.
#'
#' M. Meilă (2007), Comparing Clusterings - an Information Based Distance.
#' \emph{Journal of Multivariate Analysis}, \bold{98}: 873–895.
#'
#' D. B. Dahl and M. A. Newton (2007), Multiple Hypothesis Testing by Clustering
#' Treatment Effects, \emph{Journal of the American Statistical Association},
#' \bold{102}, 517-526.
#'
#' A. Fritsch and K. Ickstadt (2009), An improved criterion for clustering based
#' on the posterior similarity matrix, \emph{Bayesian Analysis}, \bold{4},
#' 367-391.
#'
#' N. X. Vinh, J. Epps, and J. Bailey (2010), Information Theoretic Measures for
#' Clusterings Comparison: Variants, Properties, Normalization and Correction
#' for Chance, \emph{Journal of Machine Learning Research}, \bold{11},
#' 2837-2854.
#'
#' S. Wade and Z. Ghahramani (2018), Bayesian cluster analysis: Point estimation
#' and credible balls. \emph{Bayesian Analysis}, \bold{13:2}, 559-626.
#'
#' R. Rastelli and N. Friel (2018), Optimal Bayesian estimators for latent
#' variable cluster models. \emph{Statistics and Computing}, \bold{28},
#' 1169-1186.
#'
#' D. B. Dahl, D. J. Johnson, and P. Müller (2022), Search Algorithms and Loss
#' Functions for Bayesian Clustering, \emph{Journal of Computational and
#' Graphical Statistics}, 31(4), 1189-1201, \doi{10.1080/10618600.2022.2069779}.
#'
#' @export
#' @examples
#' # For examples, use 'nCores=1' per CRAN rules, but in practice omit this.
#' data(iris.clusterings)
#' partitions <- iris.clusterings[1:5,]
#'
#' # R_CARGO \dontrun{
#' # R_CARGO # Example disabled since Cargo was not found when installing from source package.
#' # R_CARGO # You can still run the example if you install Cargo. Hint: cargo::install().
#' all.equal(partition.loss(partitions, partitions, loss=binder(a=1.4)),
#'           binder(partitions, partitions, a=1.4))
#' all.equal(partition.loss(partitions, partitions, loss=omARI()),
#'           omARI(partitions, partitions))
#' all.equal(partition.loss(partitions, partitions, loss=VI(a=0.8)),
#'           VI(partitions, partitions, a=0.8))
#'
#' truth <- iris.clusterings[1,]
#' estimate <- iris.clusterings[2,]
#'
#' VI(truth, estimate, a=1.0)
#' n <- length(truth)
#' all.equal(binder(truth, estimate), ( 1 - RI(truth, estimate) ) * (n-1) / n)
#' all.equal(omARI(truth, estimate), 1 - ARI(truth, estimate))
#' # R_CARGO }
#'
partition.loss <- function(truth, estimate, loss=VI()) {
  expected.loss(truth, estimate, loss)
}

checkAokay <- function(a) {
  if ( is.vector(a) && length(a) == 1 && is.numeric(a) && 0.0 <= a && a <= 2.0 ) {
      return()
  } else if ( is.null(a) ) {
      return()
  } else if ( is.list(a) && all(c("nClusters") %in% names(a)) &&
              is.vector(a$nClusters) && length(a$nClusters) == 1 && is.numeric(a$nClusters) && a$nClusters >= 1 ) {
      return()
  } else {
     stop("'a' should be in a scalar in [0,2].")
  }
}

#' @export
#' @rdname partition.loss
binder <- function(truth, estimate, a=1) {
  if ( missing(truth) && missing(estimate) ) {
    checkAokay(a)
    structure(list(loss="binder", a=a), class="salso.loss")
  } else {
    expected.loss(truth, estimate, Recall(a=a))
  }
}

#' @export
#' @rdname partition.loss
RI <- function(truth, estimate) {
  1 - binder(truth, estimate) * length(truth) / (length(truth)-1)
}

#' @export
#' @rdname partition.loss
omARI <- function(truth, estimate) {
  if ( missing(truth) && missing(estimate) ) {
    structure(list(loss="omARI"), class="salso.loss")
  } else {
    expected.loss(truth, estimate, Recall())
  }
}

#' @export
#' @rdname partition.loss
omARI.approx <- function(truth, estimate) {
  if ( missing(truth) && missing(estimate) ) {
    structure(list(loss="omARI.approx"), class="salso.loss")
  } else {
    expected.loss(truth, estimate, Recall())
  }
}

#' @export
#' @rdname partition.loss
ARI <- function(truth, estimate) {
  1 - omARI(truth, estimate)
}

#' @export
#' @rdname partition.loss
VI <- function(truth, estimate, a=1) {
  if ( missing(truth) && missing(estimate) ) {
    checkAokay(a)
    structure(list(loss="VI", a=a), class="salso.loss")
  } else {
    expected.loss(truth, estimate, Recall(a=a))
  }
}

#' @export
#' @rdname partition.loss
VI.lb <- function(truth, estimate) {
  if ( missing(truth) && missing(estimate) ) {
    structure(list(loss="VI.lb"), class="salso.loss")
  } else {
    expected.loss(truth, estimate, Recall())
  }
}

#' @export
#' @rdname partition.loss
NVI <- function(truth, estimate) {
  if ( missing(truth) && missing(estimate) ) {
    structure(list(loss="NVI"), class="salso.loss")
  } else {
    expected.loss(truth, estimate, Recall())
  }
}

#' @export
#' @rdname partition.loss
ID <- function(truth, estimate) {
  if ( missing(truth) && missing(estimate) ) {
    structure(list(loss="ID"), class="salso.loss")
  } else {
    expected.loss(truth, estimate, Recall())
  }
}

#' @export
#' @rdname partition.loss
NID <- function(truth, estimate) {
  if ( missing(truth) && missing(estimate) ) {
    structure(list(loss="NID"), class="salso.loss")
  } else {
    expected.loss(truth, estimate, Recall())
  }
}

expected.loss <- function(truth, estimate, loss) {
  truth <- unclass(truth)
  estimate <- unclass(estimate)
  if ( ! ( is.atomic(truth) && is.atomic(estimate) ) ) {
    stop("'truth' and 'estimate' must be atomic.")
  }
  if ( ! is.matrix(truth) ) dim(truth) <- c(1,length(truth))
  if ( ! is.matrix(estimate) ) dim(estimate) <- c(1,length(estimate))
  if ( nrow(estimate) == 0 ) return(numeric())
  if ( ncol(truth) != ncol(estimate) ) {
    stop("The number of items (i.e., number of columns) in 'truth' and 'estimate' are not the same.")
  }
  if ( ncol(truth) == 0 ) return(rep(NA, nrow(estimate)))
  z <- x2drawspsm(truth, loss)
  if ( is.list(z$a) ) stop("'a' must be explicitly provided when computing the expected loss.")
  .Call(.expected_loss, estimate, z$draws, z$psm, z$lossCode, z$a)
}

Any scripts or data that you put into this service are public.

salso documentation built on July 26, 2023, 5:32 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

salso
Search Algorithms and Loss Functions for Bayesian Clustering

R/partition.loss.R
In salso: Search Algorithms and Loss Functions for Bayesian Clustering

Defines functions expected.loss NID ID NVI VI.lb VI ARI omARI.approx omARI RI binder checkAokay partition.loss

Documented in ARI binder ID NID NVI omARI omARI.approx partition.loss RI VI VI.lb

Try the salso package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

salso Search Algorithms and Loss Functions for Bayesian Clustering

R/partition.loss.R In salso: Search Algorithms and Loss Functions for Bayesian Clustering

Defines functions expected.loss NID ID NVI VI.lb VI ARI omARI.approx omARI RI binder checkAokay partition.loss

Documented in ARI binder ID NID NVI omARI omARI.approx partition.loss RI VI VI.lb

Try the salso package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

salso
Search Algorithms and Loss Functions for Bayesian Clustering

R/partition.loss.R
In salso: Search Algorithms and Loss Functions for Bayesian Clustering