bqs | R Documentation |
Estimates the expected quadratic score for clustering solutions provided by a list of of candidate model, methods or algorithmic setting.
bqs(data,
methodset,
B = 10,
type = "smooth",
oob = FALSE,
ncores = detectCores() - 2,
alpha = 0.05,
rankby = ifelse(B == 0, "mean", "lq"),
boot_na_share = 0.25,
savescores = FALSE,
saveparams = FALSE)
data |
a numeric vector, matrix, or data frame of observations. Rows
correspond to observations and columns correspond to
variables/features.
Categorical variables and |
methodset |
a list of functions. A function in the list takes |
B |
a integer |
type |
character string specifying the type of score,
Possibile values are |
oob |
logical or character string specifying if out-of-bag bootstrap
is performed.
Possibile values are |
ncores |
an integer, it defines the number of cores used for parallel computing (see Details). |
alpha |
a number in |
rankby |
character string specifying how the scored solution are ranked.
Possible values are |
boot_na_share |
a numeric value in |
savescores |
logical, if |
saveparams |
logical, if |
The function implements the estimation and selection of an
appropriate clustering based on the methodology proposed in Coraggio
and Coretto (2023). In addition, we add the possibility of obtaining
score estimates using out-of-bag-boorstrap sampling alongside the
empirical bootstrap-based estimates proposed in the aforementioned
paper. Note that the out-of-bag-boorstrap estimates are obtained
using the same samples used for the emprical bootsrap, therefore,
oob=TRUE
add a small computational cost.
Choice of B
. In theory B
should be as large as
possible, however, if the list of methods is large and the
computational capacity is modest, a large B
may require
long run times. Coraggio and Coretto (2023) show experiments where
changing from B=1000
to B=100
introduces a marginal
increase in variability. B=100
should be considered as a
lower bound. In the case where one has very large method lists,
high-dimensional datasets and demanding methods, a possible strategy
to reduce the computational cost is as follows:
set a small value of B
, e.g., B=50
or even less.
Analyze the methods' ranking and identify those methods that report score values that are small compared to the top performers.
Narrow down the methodset
list and repeat the
bootstrap estimation with a value of B
that is as large as
possible relative to available computational resources.
Parallel computing. Bootstrap sampling is performed using
foreach
-based parallel computation via the doParallel
parallel backend. Note that depending on the system settings, and
how the functions in methodset
make use of parallelism and/or
multi-threading computing, increasing ncores
may not produce
the desired reduction in computing time. For instance, this happens
when using linear algebra routines optimized for multi-threaded
computing (e.g., OpenBLAS, Intel Math Kernel Library (MKL), and so
on). These optimized shared libraries already implement
multi-threading, and it is necessary to find the optimal trade-off
between distributing processes over physical cores and
multi-threading at the logical unit level of the same physical unit.
Unfortunately, there is no universal recipe and much depends on
hardware architectures, operating system, shared libraries, etc.
We obtained the best results using OpenBLAS (the tagged
serial) and setting ncores=
the number of physical
cores.
methodset
argument. The methodset
argument
allows in input a function
, list
, or output from mset
functions: mset_user
, mset_gmix
, mset_kmeans
,
mset_pam
. It is also possible give any combination of these,
concatenated with the mbind
function. When passing a
function
, either as a single element or in a list, this must
take the data set as its first argument, and must return in output
at least list
named "params", conforming with the
return value of clust2params
, i.e. a list containing
proportion
, mean
and cov
elements, representing
the estimated clusters' parameters.
An S3 object of class bqs
. Output components are as follows:
smooth |
data.frame returned if
|
hard |
data.frame returned if
|
obb_smooth |
data.frame returned if
|
obb_hard |
data.frame returned if
|
best_smooth |
Clustering produced by the best method, according the
specified |
best_hard |
clustering produced by the best method, according the
specified |
best_obb_smooth |
clustering produced by the best method, according the
specified |
best_obb_hard |
clustering produced by the best method, according the
specified |
data |
a list containing information about the input |
B |
number of bootstrap replicates. |
methodset |
The elements of |
rankby |
the ranking criterion. |
raw |
a list that allows tracing the bootstrap sampling in almost
every stage.
Let
|
Coraggio, Luca and Pietro Coretto (2023). Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score. Journal of Multivariate Analysis, Vol. 196(105181), 1-20. doi: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2023.105181")}
mset_user
, mset_gmix
, mset_kmeans
,
pam
, mbind
, clust2params
# load data
data("banknote")
dat <- banknote[-1]
## set up methods
## see also help('mset_user') and related functions
KM <- mset_kmeans(K = 3)
GMIX <- mset_gmix(K=3, erc=c(1,100))
# combine tuned methods
mlist <- mbind(KM, GMIX)
# perform bootstrap
# change B and ncores to a much larger value in real problems
res <- bqs(dat, mlist, B = 3, type = "both", rankby="lq",
ncores = 1, oob = TRUE, savescores = TRUE, saveparams = FALSE)
res
## Not run:
# The following example is more realistic but may take time
# ----------------------------------------------------------
# load data
data("banknote")
dat <- banknote[-1]
# set up kmeans, see help('mset_kmeans')
KM <- mset_kmeans(K = 2:5)
# set up Gaussian model-based clustering via gmix()
GMIX <- mset_gmix(K=2:5, erc=c(1, 50 , 100))
# set up Gaussian model-based clustering via library("mclust")
# see examples in help('mset_user')
require(mclust)
mc_wrapper <- function(data, K, ...){
y <- Mclust(data, G = K, ...)
y[["params"]] <- list(proportion = y$parameters$pro,
mean = y$parameters$mean,
cov = y$parameters$variance$sigma)
return(y)
}
MC <- mset_user(fname = "mc_wrapper", K = 2:5, modelNames = c("EEI", "VVV"))
# combine tuned methods
mlist <- mbind(KM, GMIX, MC)
# perform bootstrap
# set 'ncores' to the number of available physical cores
res <- bqs(dat, mlist, B = 100, type = "both", rankby="lq", ncores=1,
oob = TRUE, savescores = TRUE, saveparams = FALSE)
res
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.