View source: R/partition.loss.R
partition.loss | R Documentation |
Given two partitions \pi*
and \pi
, these functions compute the
specified loss when using \pi*
to estimate \pi
. Smaller loss
values indicate higher concordance between partitions. These functions also
compute a Monte Carlo estimate of the expectation for the specified loss
based on samples or a pairwise similarity matrix. This function also supports
computing approximations to the expectation of several losses. Supported
criteria are described below. Some criteria only require the pairwise
similarity matrix (as computed, for example, by psm
) whereas
others require samples from a partition distribution. For those criteria that
only need the pairwise similarity matrix, posterior samples can still be
provided in the truth
argument and the pairwise similarity matrix will
automatically be computed as needed.
partition.loss(truth, estimate, loss = VI())
binder(truth, estimate, a = 1)
RI(truth, estimate)
omARI(truth, estimate)
omARI.approx(truth, estimate)
ARI(truth, estimate)
VI(truth, estimate, a = 1)
VI.lb(truth, estimate)
NVI(truth, estimate)
ID(truth, estimate)
NID(truth, estimate)
truth |
An integer vector of cluster labels for |
estimate |
An integer vector of cluster labels having the same length as
|
loss |
The loss function to use, as indicated by |
a |
(Only used for Binder and VI loss) The argument |
The partition estimation criterion can be specified using the loss
argument, which is either a string or a result of calling the associated
functions. These losses are described below:
"binder"
Binder. Whereas high values of the Rand index R
between \pi*
and \pi
correspond to high concordance between the
partitions, the N-invariant Binder loss L
for a partition \pi*
in
estimating \pi
is L = (1-R)*(n-1)/n
, meaning that low values
correspond to high concordance between the partitions. This package reports
the N-invariant Binder loss. The original Binder loss equals the N-invariant
Binder loss multiplied by n^2 / 2
. Only the pairwise similarity matrix
is required for this loss, but samples can be provided. As originally
discussed by Binder (1978), two mistakes are possible: 1. Placing two items
in separate clusters when in truth they belong to the same cluster, and 2.
Placing two items in the same cluster when in truth they belong to separate
clusters. The default cost of the first mistake is one, but can be specified
with the argument a
in [0,2]
. Without loss of generality, the cost of
the second mistake is set as 2-a
. For a discussion of general
weights, see Dahl, Johnson, and Müller (2021). For a discussion of the equal
weights case, see also Dahl (2006), Lau and Green (2007), Dahl and Newton
(2007), Fritsch and Ickstadt (2009), and Wade and Ghahramani (2018).
"omARI"
One Minus Adjusted Rand Index. Computes the expectation
of one minus the adjusted Rand index (Hubert and Arabie, 1985). Whereas high
values of the adjusted Rand index between \pi*
and \pi
correspond
to high concordance between the partitions, the loss associated with the
adjusted Rand index for a partition \pi*
in estimating \pi
is one
minus the adjusted Rand index between the partitions, meaning that low values
correspond to high concordance between the partitions. Samples from a
partition distribution are required for this loss. See Fritsch and Ickstadt
(2009).
"omARI.approx"
Approximation of One Minus Adjusted Rand Index.
Computes the first-order approximation of the expectation of one minus the
adjusted Rand index. The adjusted Rand index involves a ratio and the
first-order approximation of the expectation is based on E(X/Y) \approx
E(X)/E(Y)
. Only the pairwise similarity matrix is required for this
criterion, but samples can be provided. See Fritsch and Ickstadt (2009).
"VI"
Variation of Information. Computes the expectation of the
(generalized) variation of information loss. Samples from a partition
distribution are required for this loss. See Meilă (2007), Wade and
Ghahramani (2018), and Rastelli and Friel (2018). The original variation of
information of Meilă (2007) has been extended to the generalized variation of
information of Dahl, Johnson, and Müller (2021) to allow for unequal
weighting of two possible mistakes: 1. Placing two items in separate clusters
when in truth they belong to the same cluster, and 2. Placing two items in
the same cluster when in truth they belong to separate clusters. The value
a
controls the cost of the first mistake and defaults to one, but can
be specified with the argument a
in [0,2]
. Without loss of generality,
the cost of the second mistake is controlled by 2-a
. See Dahl,
Johnson, Müller (2021).
"VI.lb"
Lower Bound of the Variation of Information. Computes the lower bound of the expectation of the variation of information loss, where the lower bound is obtained by Jensen's inequality. Only the pairwise similarity matrix is required for this criterion, but samples can be provided. See Wade and Ghahramani (2018).
"NVI"
Normalized Variation of Information. Computes the expectation of the normalized variation of information loss. Samples from a partition distribution are required for this loss. See Vinh, Epps, and Bailey (2010) and Rastelli and Friel (2018).
"ID"
Information Distance. Computes the expectation of the
information distance (D_{max}
) loss. Samples from a partition
distribution are required for this loss. See Vinh, Epps, and Bailey (2010).
"NID"
Normalized Information Distance. Computes the expectation of the normalized information distance loss. Samples from a partition distribution are required for this loss. See Vinh, Epps, and Bailey (2010) and Rastelli and Friel (2018).
The functions RI
and ARI
are convenience
functions. Note that:
binder(p1, p2, a=1) = ( 1 - RI(p1, p2) )*(n-1)/n
omARI(p1, p2) = 1 - ARI(p1, p2)
A numeric vector.
W. M. Rand (1971), Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66: 846–850.
D. A. Binder (1978), Bayesian cluster analysis, Biometrika 65, 31-38.
L. Hubert and P. Arabie (1985), Comparing Partitions. Journal of Classification, 2, 193–218.
D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter Müller, Marina Vannucci (Eds.), Cambridge University Press.
J. W. Lau and P. J. Green (2007), Bayesian model based clustering procedures, Journal of Computational and Graphical Statistics 16, 526-558.
M. Meilă (2007), Comparing Clusterings - an Information Based Distance. Journal of Multivariate Analysis, 98: 873–895.
D. B. Dahl and M. A. Newton (2007), Multiple Hypothesis Testing by Clustering Treatment Effects, Journal of the American Statistical Association, 102, 517-526.
A. Fritsch and K. Ickstadt (2009), An improved criterion for clustering based on the posterior similarity matrix, Bayesian Analysis, 4, 367-391.
N. X. Vinh, J. Epps, and J. Bailey (2010), Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance, Journal of Machine Learning Research, 11, 2837-2854.
S. Wade and Z. Ghahramani (2018), Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis, 13:2, 559-626.
R. Rastelli and N. Friel (2018), Optimal Bayesian estimators for latent variable cluster models. Statistics and Computing, 28, 1169-1186.
D. B. Dahl, D. J. Johnson, and P. Müller (2022), Search Algorithms and Loss Functions for Bayesian Clustering, Journal of Computational and Graphical Statistics, 31(4), 1189-1201, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/10618600.2022.2069779")}.
psm
, salso
, dlso
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this.
data(iris.clusterings)
partitions <- iris.clusterings[1:5,]
# R_CARGO \dontrun{
# R_CARGO # Example disabled since Cargo was not found when installing from source package.
# R_CARGO # You can still run the example if you install Cargo. Hint: cargo::install().
all.equal(partition.loss(partitions, partitions, loss=binder(a=1.4)),
binder(partitions, partitions, a=1.4))
all.equal(partition.loss(partitions, partitions, loss=omARI()),
omARI(partitions, partitions))
all.equal(partition.loss(partitions, partitions, loss=VI(a=0.8)),
VI(partitions, partitions, a=0.8))
truth <- iris.clusterings[1,]
estimate <- iris.clusterings[2,]
VI(truth, estimate, a=1.0)
n <- length(truth)
all.equal(binder(truth, estimate), ( 1 - RI(truth, estimate) ) * (n-1) / n)
all.equal(omARI(truth, estimate), 1 - ARI(truth, estimate))
# R_CARGO }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.