bqs: Bootstrapping quadratic scores
In qcluster: Clustering via Quadratic Scoring

View source: R/bqs.R

bqs	R Documentation

Bootstrapping quadratic scores

Description

Estimates the expected quadratic score for clustering solutions provided by a list of of candidate model, methods or algorithmic setting.

Usage

   bqs(data,
       methodset,
       B = 10,
       type = "smooth",
       oob = FALSE, 
       ncores = detectCores() - 2,
       alpha = 0.05,
       rankby = ifelse(B == 0, "mean", "lq"),
       boot_na_share = 0.25,
       savescores = FALSE,
       saveparams = FALSE)

Arguments

`data`	a numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Categorical variables and `NA` values are not allowed.
`methodset`	a list of functions. A function in the list takes `data` as input and provide a clustering method to be scored (see Details).
`B`	a integer `>=0`. If `B=0`, the funciton fits and scores the clustering methods on the entire data set without resampling (see Details). `B>=1`, sets the number of boostrap replicates (see Details).
`type`	character string specifying the type of score, Possibile values are `{"smooth", "hard", "both"}`. If `="smooth"` (default), only the smooth score is estimated. If `="hard"`, only the hard score is estimated. `="both"`, both the smooth and the hard scores are estimated.
`oob`	logical or character string specifying if out-of-bag bootstrap is performed. Possibile values are `{FALSE, TRUE, "only"}` If `=FALSE` (default), out-of-bag boostrap is not performed. If `=TRUE`, out-of-bag bootstrap is performed along with the empirical bootstrap sampling. If `="only"`, only the out-of-bag bootstrap is performed.
`ncores`	an integer, it defines the number of cores used for parallel computing (see Details).
`alpha`	a number in `(0,1)`, the confidence-level for empirical bootstrap quantiles (both-tails).
`rankby`	character string specifying how the scored solution are ranked. Possible values are `{"lq", "mean", "1se"}`. With `="lq"` (default), the solutions are ranked by maximizing the estimated lower limit of the of the `1-alpha` bootstrap confidence intervalfor the expected score. With `="mean"`, the solutions are ranked by maximizing the estimated expected score. With `="1se"`, the solutions are ranked by maximizing the estimated lower limit of the confidence interval for the expected score whose semi-length is equal to a standard error. The expected score's standard error is approximated using the boostrap distribution.
`boot_na_share`	a numeric value in `(0,1)`. During the boostrapping a method's score is set to `NA` if the underlying comptutation runs into errors. Methods resulting in more than `B * boot_na_share` errors are excluded from the comparison
`savescores`	logical, if `=TRUE` it returns estimated scores for each boostrap sample.
`saveparams`	logical, if `=TRUE` it returns estimated cluster parameters for each boostrap sample.

Details

The function implements the estimation and selection of an appropriate clustering based on the methodology proposed in Coraggio and Coretto (2023). In addition, we add the possibility of obtaining score estimates using out-of-bag-boorstrap sampling alongside the empirical bootstrap-based estimates proposed in the aforementioned paper. Note that the out-of-bag-boorstrap estimates are obtained using the same samples used for the emprical bootsrap, therefore, oob=TRUE add a small computational cost.

Choice of B. In theory B should be as large as possible, however, if the list of methods is large and the computational capacity is modest, a large B may require long run times. Coraggio and Coretto (2023) show experiments where changing from B=1000 to B=100 introduces a marginal increase in variability. B=100 should be considered as a lower bound. In the case where one has very large method lists, high-dimensional datasets and demanding methods, a possible strategy to reduce the computational cost is as follows:

set a small value of B, e.g., B=50 or even less.
Analyze the methods' ranking and identify those methods that report score values that are small compared to the top performers.
Narrow down the methodset list and repeat the bootstrap estimation with a value of B that is as large as possible relative to available computational resources.

Parallel computing. Bootstrap sampling is performed using foreach-based parallel computation via the doParallel parallel backend. Note that depending on the system settings, and how the functions in methodset make use of parallelism and/or multi-threading computing, increasing ncores may not produce the desired reduction in computing time. For instance, this happens when using linear algebra routines optimized for multi-threaded computing (e.g., OpenBLAS, Intel Math Kernel Library (MKL), and so on). These optimized shared libraries already implement multi-threading, and it is necessary to find the optimal trade-off between distributing processes over physical cores and multi-threading at the logical unit level of the same physical unit. Unfortunately, there is no universal recipe and much depends on hardware architectures, operating system, shared libraries, etc. We obtained the best results using OpenBLAS (the tagged serial) and setting ncores= the number of physical cores.

methodset argument. The methodset argument allows in input a function, list, or output from mset functions: mset_user, mset_gmix, mset_kmeans, mset_pam. It is also possible give any combination of these, concatenated with the mbind function. When passing a function, either as a single element or in a list, this must take the data set as its first argument, and must return in output at least list named "params", conforming with the return value of clust2params, i.e. a list containing proportion, mean and cov elements, representing the estimated clusters' parameters.

Value

An S3 object of class bqs. Output components are as follows:

`smooth`	data.frame returned if `type="smooth"` or `type="both"`. It contains a summary of the estimated score. The rows corresponds to competing methods in `methodset` sorted by the specified ranking criterion. Columns are as follows: `id:` index of the method in the corresponding `methodset` list; `rank:` rank of the clustering solution according to `rankby`; `mean:` expected score; `sterr` standard error for the mean score; `lower_qnt:` lower limit of the confidence interval for the mean score; `upper_qnt:` upper limit of the confidence interval for the mean score; `n_obs:` number of valid bootstrap samples after filtering erroneous cases; `n_missing:` number of filtered erroneous cases.
`hard`	data.frame returned if `type="hard"` or `type="both"`. It reports the results about the hard score in analogy to the previous object `smooth`.
`obb_smooth`	data.frame returned if `type="smooth"` or `type="both"` and `obb=TRUE` or `obb="only"`. It reports the results about the smooth score estimated using out-of-bang-boostrap samples in analogy to the previous objects `smooth` and `hard`.
`obb_hard`	data.frame returned if `type="hard"` or `type="both"` and `obb=TRUE` or `obb="only"`. It reports the results about the hard score estimated using out-of-bang-boostrap samples in analogy to the previous objects `smooth` and `hard`.
`best_smooth`	Clustering produced by the best method, according the specified `rankby` criterion, applied to the smooth score estimated using the empirical bootstrap sampling.
`best_hard`	clustering produced by the best method, according the specified `rankby` criterion, applied to the hard score estimated using the empirical bootstrap sampling.
`best_obb_smooth`	clustering produced by the best method, according the specified `rankby` criterion, applied to the smooth score estimated using the out-of-bag-boostrap samples.
`best_obb_hard`	clustering produced by the best method, according the specified `rankby` criterion, applied to the hard score estimated using the out-of-bag-boostrap samples.
`data`	a list containing information about the input `data` set necessary for the fruition of the returned object.
`B`	number of bootstrap replicates.
`methodset`	The elements of `methodset` for which a solution is produced.
`rankby`	the ranking criterion.
`raw`	a list that allows tracing the bootstrap sampling in almost every stage. Let `n=`sample size, `B=`bootstrap samples, `M=` number of methods in `methoset`. It contains the following objects. `boot_id`: an array of dimension `n x B` where the `j`-th column contains the indexes of the observed data points belonging to the `j`-th bootstrap sample. That is, `data[boot_id[ ,j], ]` gives the `j`-th bootstrap data set. `scores`: an array of dimension `(M x 3 x B)` returned if `savescores=TRUE`. It reports both hard and smooth scores estimated in each bootstrap replicate. `score[,1,]` reports a `code=1` if the corresponding bootstrap sample has been excluded because of errors (otherwise `code=0`). `oob_scores`: returned if `obb=TRUE` or `obb="only"` and `savescores=TRUE`. It is an array is organized as the previous object `score` but contains information about out-of-bag-bootstrap estimates. `params`: a list returned if `saveparams=TRUE`. `params[[m]]` contains estimated cluster parameters for `methodset[[m]]` where `m=1,...,M`. Each member of the list is a list of length `B` where `params[[m]][[b]]` contains the cluster parameters fitted by the `m`-th method on the `b`-th bootstrap sample.

References

Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2023.105181")}

Examples


# load data
data("banknote")
dat <- banknote[-1]

## set up methods
## see also help('mset_user')  and related functions 
KM   <- mset_kmeans(K = 3)
GMIX <- mset_gmix(K=3, erc=c(1,100)) 

# combine tuned methods
mlist <- mbind(KM, GMIX)

# perform bootstrap
# change B and ncores to a much larger value in real problems 
res <- bqs(dat, mlist, B = 3, type = "both", rankby="lq",
         ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE)
res


## Not run: 
# The following example is more realistic but may take time
# ----------------------------------------------------------
# load data
data("banknote")
dat <- banknote[-1]

# set up kmeans, see help('mset_kmeans')
KM    <- mset_kmeans(K = 2:5)

# set up Gaussian model-based clustering via gmix() 
GMIX  <- mset_gmix(K=2:5, erc=c(1, 50 , 100)) 

# set up Gaussian model-based clustering via library("mclust")
# see examples in help('mset_user')
require(mclust)
mc_wrapper <- function(data, K, ...){
    y <- Mclust(data, G = K, ...)
    y[["params"]] <- list(proportion = y$parameters$pro,
                          mean = y$parameters$mean,
                          cov = y$parameters$variance$sigma)
    return(y)
    }
MC <- mset_user(fname = "mc_wrapper", K = 2:5, modelNames = c("EEI", "VVV"))


# combine tuned methods
mlist <- mbind(KM, GMIX, MC)

# perform bootstrap
# set 'ncores' to the number of available physical cores
res <- bqs(dat, mlist, B = 100, type = "both", rankby="lq", ncores=1,
           oob = TRUE, savescores = TRUE, saveparams = FALSE)
res

## End(Not run)

qcluster documentation built on April 3, 2025, 6:16 p.m.