bqs: Bootstrapping quadratic scores

View source: R/bqs.R

bqsR Documentation

Bootstrapping quadratic scores

Description

Estimates the expected quadratic score for clustering solutions provided by a list of of candidate model, methods or algorithmic setting.

Usage

   bqs(data,
       methodset,
       B = 10,
       type = "smooth",
       oob = FALSE, 
       ncores = detectCores() - 2,
       alpha = 0.05,
       rankby = ifelse(B == 0, "mean", "lq"),
       boot_na_share = 0.25,
       savescores = FALSE,
       saveparams = FALSE)

Arguments

data

a numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Categorical variables and NA values are not allowed.

methodset

a list of functions. A function in the list takes data as input and provide a clustering method to be scored (see Details).

B

a integer >=0. If B=0, the funciton fits and scores the clustering methods on the entire data set without resampling (see Details). B>=1, sets the number of boostrap replicates (see Details).

type

character string specifying the type of score, Possibile values are {"smooth", "hard", "both"}. If ="smooth" (default), only the smooth score is estimated. If ="hard", only the hard score is estimated. ="both", both the smooth and the hard scores are estimated.

oob

logical or character string specifying if out-of-bag bootstrap is performed. Possibile values are {FALSE, TRUE, "only"} If =FALSE (default), out-of-bag boostrap is not performed. If =TRUE, out-of-bag bootstrap is performed along with the empirical bootstrap sampling. If ="only", only the out-of-bag bootstrap is performed.

ncores

an integer, it defines the number of cores used for parallel computing (see Details).

alpha

a number in (0,1), the confidence-level for empirical bootstrap quantiles (both-tails).

rankby

character string specifying how the scored solution are ranked. Possible values are {"lq", "mean", "1se"}. With ="lq" (default), the solutions are ranked by maximizing the estimated lower limit of the of the 1-alpha bootstrap confidence intervalfor the expected score. With ="mean", the solutions are ranked by maximizing the estimated expected score. With ="1se", the solutions are ranked by maximizing the estimated lower limit of the confidence interval for the expected score whose semi-length is equal to a standard error. The expected score's standard error is approximated using the boostrap distribution.

boot_na_share

a numeric value in (0,1). During the boostrapping a method's score is set to NA if the underlying comptutation runs into errors. Methods resulting in more than B * boot_na_share errors are excluded from the comparison

savescores

logical, if =TRUE it returns estimated scores for each boostrap sample.

saveparams

logical, if =TRUE it returns estimated cluster parameters for each boostrap sample.

Details

The function implements the estimation and selection of an appropriate clustering based on the methodology proposed in Coraggio and Coretto (2023). In addition, we add the possibility of obtaining score estimates using out-of-bag-boorstrap sampling alongside the empirical bootstrap-based estimates proposed in the aforementioned paper. Note that the out-of-bag-boorstrap estimates are obtained using the same samples used for the emprical bootsrap, therefore, oob=TRUE add a small computational cost.

Choice of B. In theory B should be as large as possible, however, if the list of methods is large and the computational capacity is modest, a large B may require long run times. Coraggio and Coretto (2023) show experiments where changing from B=1000 to B=100 introduces a marginal increase in variability. B=100 should be considered as a lower bound. In the case where one has very large method lists, high-dimensional datasets and demanding methods, a possible strategy to reduce the computational cost is as follows:

  1. set a small value of B, e.g., B=50 or even less.

  2. Analyze the methods' ranking and identify those methods that report score values that are small compared to the top performers.

  3. Narrow down the methodset list and repeat the bootstrap estimation with a value of B that is as large as possible relative to available computational resources.

Parallel computing. Bootstrap sampling is performed using foreach-based parallel computation via the doParallel parallel backend. Note that depending on the system settings, and how the functions in methodset make use of parallelism and/or multi-threading computing, increasing ncores may not produce the desired reduction in computing time. For instance, this happens when using linear algebra routines optimized for multi-threaded computing (e.g., OpenBLAS, Intel Math Kernel Library (MKL), and so on). These optimized shared libraries already implement multi-threading, and it is necessary to find the optimal trade-off between distributing processes over physical cores and multi-threading at the logical unit level of the same physical unit. Unfortunately, there is no universal recipe and much depends on hardware architectures, operating system, shared libraries, etc. We obtained the best results using OpenBLAS (the tagged serial) and setting ncores= the number of physical cores.

methodset argument. The methodset argument allows in input a function, list, or output from mset functions: mset_user, mset_gmix, mset_kmeans, mset_pam. It is also possible give any combination of these, concatenated with the mbind function. When passing a function, either as a single element or in a list, this must take the data set as its first argument, and must return in output at least list named "params", conforming with the return value of clust2params, i.e. a list containing proportion, mean and cov elements, representing the estimated clusters' parameters.

Value

An S3 object of class bqs. Output components are as follows:

smooth

data.frame returned if type="smooth" or type="both". It contains a summary of the estimated score. The rows corresponds to competing methods in methodset sorted by the specified ranking criterion. Columns are as follows:

id:

index of the method in the corresponding methodset list;

rank:

rank of the clustering solution according to rankby;

mean:

expected score;

sterr

standard error for the mean score;

lower_qnt:

lower limit of the confidence interval for the mean score;

upper_qnt:

upper limit of the confidence interval for the mean score;

n_obs:

number of valid bootstrap samples after filtering erroneous cases;

n_missing:

number of filtered erroneous cases.

hard

data.frame returned if type="hard" or type="both". It reports the results about the hard score in analogy to the previous object smooth.

obb_smooth

data.frame returned if type="smooth" or type="both" and obb=TRUE or obb="only". It reports the results about the smooth score estimated using out-of-bang-boostrap samples in analogy to the previous objects smooth and hard.

obb_hard

data.frame returned if type="hard" or type="both" and obb=TRUE or obb="only". It reports the results about the hard score estimated using out-of-bang-boostrap samples in analogy to the previous objects smooth and hard.

best_smooth

Clustering produced by the best method, according the specified rankby criterion, applied to the smooth score estimated using the empirical bootstrap sampling.

best_hard

clustering produced by the best method, according the specified rankby criterion, applied to the hard score estimated using the empirical bootstrap sampling.

best_obb_smooth

clustering produced by the best method, according the specified rankby criterion, applied to the smooth score estimated using the out-of-bag-boostrap samples.

best_obb_hard

clustering produced by the best method, according the specified rankby criterion, applied to the hard score estimated using the out-of-bag-boostrap samples.

data

a list containing information about the input data set necessary for the fruition of the returned object.

B

number of bootstrap replicates.

methodset

The elements of methodset for which a solution is produced.

rankby

the ranking criterion.

raw

a list that allows tracing the bootstrap sampling in almost every stage. Let n=sample size, B=bootstrap samples, M= number of methods in methoset. It contains the following objects.

boot_id:

an array of dimension n x B where the j-th column contains the indexes of the observed data points belonging to the j-th bootstrap sample. That is, data[boot_id[ ,j], ] gives the j-th bootstrap data set.

scores:

an array of dimension (M x 3 x B) returned if savescores=TRUE. It reports both hard and smooth scores estimated in each bootstrap replicate. score[,1,] reports a code=1 if the corresponding bootstrap sample has been excluded because of errors (otherwise code=0).

oob_scores:

returned if obb=TRUE or obb="only" and savescores=TRUE. It is an array is organized as the previous object score but contains information about out-of-bag-bootstrap estimates.

params:

a list returned if saveparams=TRUE. params[[m]] contains estimated cluster parameters for methodset[[m]] where m=1,...,M. Each member of the list is a list of length B where params[[m]][[b]] contains the cluster parameters fitted by the m-th method on the b-th bootstrap sample.

References

Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2023.105181")}

See Also

mset_user, mset_gmix, mset_kmeans, pam, mbind, clust2params

Examples


# load data
data("banknote")
dat <- banknote[-1]

## set up methods
## see also help('mset_user')  and related functions 
KM   <- mset_kmeans(K = 3)
GMIX <- mset_gmix(K=3, erc=c(1,100)) 

# combine tuned methods
mlist <- mbind(KM, GMIX)

# perform bootstrap
# change B and ncores to a much larger value in real problems 
res <- bqs(dat, mlist, B = 3, type = "both", rankby="lq",
         ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE)
res


## Not run: 
# The following example is more realistic but may take time
# ----------------------------------------------------------
# load data
data("banknote")
dat <- banknote[-1]

# set up kmeans, see help('mset_kmeans')
KM    <- mset_kmeans(K = 2:5)

# set up Gaussian model-based clustering via gmix() 
GMIX  <- mset_gmix(K=2:5, erc=c(1, 50 , 100)) 

# set up Gaussian model-based clustering via library("mclust")
# see examples in help('mset_user')
require(mclust)
mc_wrapper <- function(data, K, ...){
    y <- Mclust(data, G = K, ...)
    y[["params"]] <- list(proportion = y$parameters$pro,
                          mean = y$parameters$mean,
                          cov = y$parameters$variance$sigma)
    return(y)
    }
MC <- mset_user(fname = "mc_wrapper", K = 2:5, modelNames = c("EEI", "VVV"))


# combine tuned methods
mlist <- mbind(KM, GMIX, MC)

# perform bootstrap
# set 'ncores' to the number of available physical cores
res <- bqs(dat, mlist, B = 100, type = "both", rankby="lq", ncores=1,
           oob = TRUE, savescores = TRUE, saveparams = FALSE)
res

## End(Not run)  

qcluster documentation built on April 3, 2025, 6:16 p.m.

Related to bqs in qcluster...