bqs: Bootstrapping quadratic scores

View source: R/bqs.R

bqsR Documentation

Bootstrapping quadratic scores

Description

Estimates the expected quadratic score for clustering solutions provided by a list of candidate models, methods or algorithmic settings.

Usage

bqs(
  data,
  methodset,
  B = 10,
  type = "smooth",
  oob = FALSE,
  ncores = detectCores() - 2,
  alpha = 0.05,
  rankby = ifelse(B == 0, "mean", "lq"),
  boot_na_share = 0.25,
  savescores = FALSE,
  saveparams = FALSE
)

Arguments

data

a numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Categorical variables and NA values are not allowed.

methodset

a function, a list of functions, or a qcmethod object. Each method takes data as input and provides a clustering solution to be scored (see Details). See also mset_*() helpers and mbind.

B

an integer >=0. If B=0, methods are fitted and scored on the full data without resampling. If B>=1, B is the number of bootstrap replicates (see Details).

type

character string in c("smooth", "hard", "both"). "smooth" (default) estimates only the smooth score, "hard" only the hard score, and "both" estimates both.

oob

logical or character string in c(FALSE, TRUE, "only"). FALSE (default) skips out-of-bag scoring, TRUE adds it to the empirical bootstrap scoring, and "only" computes only the out-of-bag scores.

ncores

an integer, it defines the number of cores used for parallel computing (see Details). Setting ncores = 0 uses all detected cores.

alpha

a number in (0,1), the confidence-level for empirical bootstrap quantiles (both-tails).

rankby

character string in c("lq", "mean", "1se") or NA to skip ranking. "lq" (default) maximizes the lower limit of the 1-alpha bootstrap confidence interval, "mean" maximizes the estimated expected score, and "1se" maximizes the lower limit of a confidence interval whose semi-length is one estimated standard error.

boot_na_share

a numeric value in (0,1). During the bootstrapping a method's score is set to NA if the underlying computation runs into errors. Methods resulting in more than B * boot_na_share errors are excluded from the comparison.

savescores

logical, if =TRUE it returns estimated scores for each bootstrap sample.

saveparams

logical, if =TRUE it returns estimated cluster parameters for each bootstrap sample.

Details

The function implements the estimation and selection of an appropriate clustering based on the methodology proposed in Coraggio and Coretto (2023). It also allows out-of-bag score estimates alongside the empirical bootstrap estimates considered in that paper. Since out-of-bag estimates reuse the same bootstrap samples, oob=TRUE adds only a small extra cost. Setting B=0 gives a no-resampling baseline rather than a bootstrap estimate.

Choice of B. In theory B should be as large as possible, however, if the list of methods is large and the computational capacity is modest, a large B may require long run times. Coraggio and Coretto (2023) show experiments where changing from B=1000 to B=100 introduces a marginal increase in variability. B=100 should be considered as a lower bound. In the case where one has very large method lists, high-dimensional datasets and demanding methods, a possible strategy to reduce the computational cost is as follows:

  1. set a small value of B, e.g., B=50 or even less.

  2. Analyze the methods' ranking and identify those methods that report score values that are small compared to the top performers.

  3. Narrow down the methodset list and repeat the bootstrap estimation with a value of B that is as large as possible relative to available computational resources.

Parallel computing. Bootstrap sampling is performed using foreach-based parallel computation via the doParallel parallel backend. Note that depending on the system settings, and how the functions in methodset make use of parallelism and/or multi-threading computing, increasing ncores may not produce the desired reduction in computing time. A common source of inefficiency is nested parallelism: foreach distributes work across R workers while linear algebra libraries (e.g., OpenBLAS, Intel Math Kernel Library (MKL), BLIS, Accelerate) also start their own threads inside each worker. By default, bqs avoids this oversubscription by forcing each worker to use a single BLAS/OpenMP thread whenever process-level parallelism is active.

methodset argument. The methodset argument allows in input a function, list, or output from mset functions: mset_user, mset_gmix, mset_kmeans, mset_pam. It is also possible to give any combination of these, concatenated with the mbind function. When passing a function, either as a single element or in a list, this must take the data set as its first argument, and must return in output at least a list named "params", conforming with the return value of clust2params, i.e. a list containing proportion, mean and cov elements, representing the estimated clusters' parameters.

When rankby is not NA, the best_* components are computed by re-estimating the selected method on the full data set. For stochastic methods, use set.seed() if reproducibility is required.

Value

An S3 object of class bqs. The exact components depend on type, oob, rankby, savescores, and saveparams.

Any available score summary among smooth, hard, oob_smooth, and oob_hard is a data frame with one row per method and columns:

id

index of the method in methodset;

rank

rank according to rankby;

mean

estimated expected score;

sterr

standard error of the mean score;

lower_qnt

lower confidence limit for the mean score;

upper_qnt

upper confidence limit for the mean score;

n_obs

number of valid bootstrap samples;

n_missing

number of filtered erroneous cases.

Other output components are:

smooth

returned if type="smooth" or type="both".

hard

returned if type="hard" or type="both".

oob_smooth

returned if type="smooth" or type="both" and oob=TRUE or oob="only".

oob_hard

returned if type="hard" or type="both" and oob=TRUE or oob="only".

best_smooth

list with components method_name and solution for the best smooth-scoring method. Returned only when ranking is available and a rank-1 solution exists.

best_hard

analogous object for the hard score.

best_oob_smooth

analogous object for the out-of-bag smooth score.

best_oob_hard

analogous object for the out-of-bag hard score.

data

a list containing information about the input data set necessary for the fruition of the returned object.

B

number of bootstrap replicates.

methodset

the elements of methodset for which a solution is produced.

rankby

the ranking criterion.

raw

a list returned if savescores=TRUE and/or saveparams=TRUE. Let n=sample size, B=bootstrap samples, M= number of methods in methodset. It may contain:

boot_id:

an array of dimension n x B where the j-th column contains the indexes of the observed data points belonging to the j-th bootstrap sample. That is, data[boot_id[, j], ] gives the j-th bootstrap data set.

scores:

an array of dimension (M x 3 x B) returned if savescores=TRUE. It stores error codes and hard/smooth scores for each bootstrap replicate.

oob_scores:

returned if oob=TRUE or oob="only" and savescores=TRUE. It is an array organized as the previous object scores, but for out-of-bag estimates.

params:

a list returned if saveparams=TRUE. params[[m]] contains estimated cluster parameters for methodset[[m]] where m=1,...,M. Each member of the list is a list of length B where params[[m]][[b]] contains the cluster parameters fitted by the m-th method on the b-th bootstrap sample.

References

Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2023.105181")}

See Also

mset_user, mset_gmix, mset_kmeans, pam, mbind, clust2params

Examples


# load data
data("banknote")
dat <- banknote[-1]

## set up methods
## see also help('mset_user') and related functions
KM   <- mset_kmeans(K = 3)
GMIX <- mset_gmix(K = 3, erc = c(1, 100))

# combine tuned methods
mlist <- mbind(KM, GMIX)

# perform bootstrap
# change B and ncores to a much larger value in real problems
res <- bqs(dat, mlist, B = 3, type = "both", rankby = "lq",
           ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE)
names(res)
res

## Not run: 
# The following example is more realistic but may take time
# ----------------------------------------------------------
# load data
data("banknote")
dat <- banknote[-1]

# set up kmeans, see help('mset_kmeans')
KM <- mset_kmeans(K = 2:5)

# set up Gaussian model-based clustering via gmix()
GMIX <- mset_gmix(K = 2:5, erc = c(1, 50, 100))

# set up Gaussian model-based clustering via library("mclust")
# see examples in help('mset_user')
require(mclust)
mc_wrapper <- function(data, K, ...){
  y <- Mclust(data, G = K, ...)
  y[["params"]] <- list(proportion = y$parameters$pro,
                        mean = y$parameters$mean,
                        cov = y$parameters$variance$sigma)
  return(y)
}
MC <- mset_user(fname = "mc_wrapper", K = 2:5,
                modelNames = c("EEI", "VVV"))

# combine tuned methods
mlist <- mbind(KM, GMIX, MC)

# perform bootstrap
# set 'ncores' to the number of available physical cores
res <- bqs(dat, mlist, B = 100, type = "both", rankby = "lq", ncores = 1,
           oob = TRUE, savescores = TRUE, saveparams = FALSE)
res

## End(Not run)


qcluster documentation built on June 5, 2026, 5:07 p.m.

Related to bqs in qcluster...