| bqs | R Documentation |
Estimates the expected quadratic score for clustering solutions provided by a list of candidate models, methods or algorithmic settings.
bqs(
data,
methodset,
B = 10,
type = "smooth",
oob = FALSE,
ncores = detectCores() - 2,
alpha = 0.05,
rankby = ifelse(B == 0, "mean", "lq"),
boot_na_share = 0.25,
savescores = FALSE,
saveparams = FALSE
)
data |
a numeric vector, matrix, or data frame of observations. Rows
correspond to observations and columns correspond to variables/features.
Categorical variables and |
methodset |
a function, a list of functions, or a |
B |
an integer |
type |
character string in |
oob |
logical or character string in |
ncores |
an integer, it defines the number of cores used for parallel
computing (see Details). Setting |
alpha |
a number in |
rankby |
character string in |
boot_na_share |
a numeric value in |
savescores |
logical, if |
saveparams |
logical, if |
The function implements the estimation and selection of an appropriate
clustering based on the methodology proposed in Coraggio and Coretto (2023).
It also allows out-of-bag score estimates alongside the empirical bootstrap
estimates considered in that paper. Since out-of-bag estimates reuse the
same bootstrap samples, oob=TRUE adds only a small extra cost.
Setting B=0 gives a no-resampling baseline rather than a bootstrap
estimate.
Choice of B. In theory B should be as large as
possible, however, if the list of methods is large and the computational
capacity is modest, a large B may require long run times. Coraggio
and Coretto (2023) show experiments where changing from B=1000 to
B=100 introduces a marginal increase in variability. B=100
should be considered as a lower bound. In the case where one has very large
method lists, high-dimensional datasets and demanding methods, a possible
strategy to reduce the computational cost is as follows:
set a small value of B, e.g., B=50 or even less.
Analyze the methods' ranking and identify those methods that report score values that are small compared to the top performers.
Narrow down the methodset list and repeat the bootstrap
estimation with a value of B that is as large as possible relative
to available computational resources.
Parallel computing. Bootstrap sampling is performed using
foreach-based parallel computation via the doParallel
parallel backend. Note that depending on the system settings, and how the
functions in methodset make use of parallelism and/or multi-threading
computing, increasing ncores may not produce the desired reduction in
computing time. A common source of inefficiency is nested parallelism:
foreach distributes work across R workers while linear algebra
libraries (e.g., OpenBLAS, Intel Math Kernel Library (MKL), BLIS,
Accelerate) also start their own threads inside each worker. By default,
bqs avoids this oversubscription by forcing each worker to use a
single BLAS/OpenMP thread whenever process-level parallelism is active.
methodset argument. The methodset argument allows in
input a function, list, or output from mset functions:
mset_user, mset_gmix, mset_kmeans, mset_pam. It
is also possible to give any combination of these, concatenated with the
mbind function. When passing a function, either as a single
element or in a list, this must take the data set as its first argument,
and must return in output at least a list named "params",
conforming with the return value of clust2params, i.e. a list
containing proportion, mean and cov elements,
representing the estimated clusters' parameters.
When rankby is not NA, the best_* components are
computed by re-estimating the selected method on the full data set. For
stochastic methods, use set.seed() if reproducibility is required.
An S3 object of class bqs. The exact components depend on
type, oob, rankby, savescores, and
saveparams.
Any available score summary among smooth, hard,
oob_smooth, and oob_hard is a data frame with one row per
method and columns:
idindex of the method in methodset;
rankrank according to rankby;
meanestimated expected score;
sterrstandard error of the mean score;
lower_qntlower confidence limit for the mean score;
upper_qntupper confidence limit for the mean score;
n_obsnumber of valid bootstrap samples;
n_missingnumber of filtered erroneous cases.
Other output components are:
returned if type="smooth" or type="both".
returned if type="hard" or type="both".
returned if type="smooth" or type="both" and
oob=TRUE or oob="only".
returned if type="hard" or type="both" and
oob=TRUE or oob="only".
list with components method_name and solution for the
best smooth-scoring method. Returned only when ranking is available and
a rank-1 solution exists.
analogous object for the hard score.
analogous object for the out-of-bag smooth score.
analogous object for the out-of-bag hard score.
a list containing information about the input data set necessary
for the fruition of the returned object.
number of bootstrap replicates.
the elements of methodset for which a solution is produced.
the ranking criterion.
a list returned if savescores=TRUE and/or saveparams=TRUE.
Let n=sample size, B=bootstrap samples,
M= number of methods in methodset. It may contain:
boot_id:an array of dimension n x B where the j-th column
contains the indexes of the observed data points belonging to the
j-th bootstrap sample. That is,
data[boot_id[, j], ] gives the j-th bootstrap data
set.
scores:an array of dimension (M x 3 x B) returned if
savescores=TRUE. It stores error codes and hard/smooth scores
for each bootstrap replicate.
oob_scores:returned if oob=TRUE or oob="only" and
savescores=TRUE. It is an array organized as the previous
object scores, but for out-of-bag estimates.
params:a list returned if saveparams=TRUE. params[[m]]
contains estimated cluster parameters for methodset[[m]]
where m=1,...,M. Each member of the list is a list of
length B where params[[m]][[b]] contains the cluster
parameters fitted by the m-th method on the b-th
bootstrap sample.
Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2023.105181")}
mset_user, mset_gmix,
mset_kmeans, pam,
mbind, clust2params
# load data
data("banknote")
dat <- banknote[-1]
## set up methods
## see also help('mset_user') and related functions
KM <- mset_kmeans(K = 3)
GMIX <- mset_gmix(K = 3, erc = c(1, 100))
# combine tuned methods
mlist <- mbind(KM, GMIX)
# perform bootstrap
# change B and ncores to a much larger value in real problems
res <- bqs(dat, mlist, B = 3, type = "both", rankby = "lq",
ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE)
names(res)
res
## Not run:
# The following example is more realistic but may take time
# ----------------------------------------------------------
# load data
data("banknote")
dat <- banknote[-1]
# set up kmeans, see help('mset_kmeans')
KM <- mset_kmeans(K = 2:5)
# set up Gaussian model-based clustering via gmix()
GMIX <- mset_gmix(K = 2:5, erc = c(1, 50, 100))
# set up Gaussian model-based clustering via library("mclust")
# see examples in help('mset_user')
require(mclust)
mc_wrapper <- function(data, K, ...){
y <- Mclust(data, G = K, ...)
y[["params"]] <- list(proportion = y$parameters$pro,
mean = y$parameters$mean,
cov = y$parameters$variance$sigma)
return(y)
}
MC <- mset_user(fname = "mc_wrapper", K = 2:5,
modelNames = c("EEI", "VVV"))
# combine tuned methods
mlist <- mbind(KM, GMIX, MC)
# perform bootstrap
# set 'ncores' to the number of available physical cores
res <- bqs(dat, mlist, B = 100, type = "both", rankby = "lq", ncores = 1,
oob = TRUE, savescores = TRUE, saveparams = FALSE)
res
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.