salso: SALSO Greedy Search
In salso: Search Algorithms and Loss Functions for Bayesian Clustering

salso

R Documentation

SALSO Greedy Search

Description

This function provides a partition to summarize a partition distribution using the SALSO greedy search method (Dahl, Johnson, and Müller, 2022). The implementation currently supports the minimization of several partition estimation criteria. For details on these criteria, see partition.loss.

Usage

salso(
  x,
  loss = VI(),
  maxNClusters = 0,
  nRuns = 16,
  maxZealousAttempts = 10,
  probSequentialAllocation = 0.5,
  nCores = 0,
  ...
)

Arguments

`x`	A `B`-by-`n` matrix, where each of the `B` rows represents a clustering of `n` items using cluster labels. For the `b`th clustering, items `i` and `j` are in the same cluster if `x[b,i] == x[b,j]`.
`loss`	The loss function to use, as indicated by `"binder"`, `"omARI"`, `"VI"`, `"NVI"`, `"ID"`, `"NID"`, or the result of calling a function with these names. Also supported are `"binder.psm"`, `"VI.lb"`, `"omARI.approx"`, or the result of calling a function with these names, in which case `x` above can optionally be a pairwise similarity matrix, i.e., `n`-by-`n` symmetric matrix whose `(i,j)` element gives the (estimated) probability that items `i` and `j` are in the same subset (i.e., cluster) of a partition (i.e., clustering). The loss functions `"binder.psm"`, `"VI.lb"`, and `"omARI.approx"` are generally not recommended and the current implementation requires that `maxZealousAttempts = 0` and `probSequentialAllocation = 1.0`.
`maxNClusters`	The maximum number of clusters that can be considered by the optimization algorithm, which has important implications for the interpretability of the resulting clustering and can greatly influence the RAM needed for the optimization algorithm. If the supplied value is zero and `x` is a matrix of clusterings, the optimization is constrained by the maximum number of clusters among the clusterings in `x`. If the supplied value is zero and `x` is a pairwise similarity matrix, there is no constraint.
`nRuns`	The number of runs to try, although the actual number may differ for the following reasons: 1. The actual number is a multiple of the number of cores specified by the `nCores` argument, and 2. The search is curtailed when the `seconds` threshold is exceeded.
`maxZealousAttempts`	The maximum number of attempts for zealous updates, in which entire clusters are destroyed and items are sequentially reallocated. While zealous updates may be helpful in optimization, they also take more CPU time which might be better used trying additional runs.
`probSequentialAllocation`	For the initial allocation, the probability of sequential allocation instead of using `sample(1:K, ncol(x), TRUE)`, where `K` is set according to the `maxNClusters` argument.
`nCores`	The number of CPU cores to use, i.e., the number of simultaneous runs at any given time. A value of zero indicates to use all cores on the system.
`...`	Extra arguments not intended for the end user, including: 1. `seconds`: Instead of performing all the requested number of runs, curtail the search after the specified expected number of seconds. Note that the function will finish earlier if all the requested runs are completed. The specified seconds does not account for the overhead involved in starting the search and returning results. 2. `maxScans` The maximum number of full reallocation scans. The actual number of scans may be less than `maxScans` since the method stops if the result does not change between scans, and 3. `probSingletonsInitialization`: When doing a sequential allocation to obtain the initial allocation, the probability of placing the first `maxNClusters` randomly-selected items in singletons subsets.

Value

An integer vector giving the estimated partition, encoded using cluster labels.

References

D. B. Dahl, D. J. Johnson, and P. Müller (2022), Search Algorithms and Loss Functions for Bayesian Clustering, Journal of Computational and Graphical Statistics, 31(4), 1189-1201, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/10618600.2022.2069779")}.

Examples

# For examples, use 'nCores=1' per CRAN rules, but in practice omit this.
data(iris.clusterings)
draws <- iris.clusterings
salso(draws, loss=VI(), nRuns=1, nCores=1)
salso(draws, loss=VI(a=0.7), nRuns=1, nCores=1)
salso(draws, loss=binder(), nRuns=1, nCores=1)
salso(iris.clusterings, binder(a=NULL), nRuns=4, nCores=1)
salso(iris.clusterings, binder(a=list(nClusters=3)), nRuns=4, nCores=1)

salso documentation built on April 11, 2025, 5:56 p.m.