mset_user | R Documentation |
The function generates a software abstraction of a list of clustering models implemented through the a set of tuned methods and algorithms. The base clustering methodology is provided via a user-defined function. The latter prototype is exapanded in a list of fucntions each combining tuning parameters and other algorithmic settings. The generated functions are ready to be called on the data set.
mset_user(fname, .packages = NULL, .export = NULL, ...)
fname |
a function implementing a user-defined clustering method. It clusters
a data set and outputs cluster parameters. |
.packages |
character vector of packages that the tasks in |
.export |
character vector of variables to export that are needed by
|
... |
parameters passed to |
The function produces functions implementing competing clustering methods
based on a prototype methodology implemented by the user via
the input argument fname
.
In particular, it builds a list of fname
-type functions each
corresponding to a specific setup in terms of
hyper-parameters (e.g. the number of clusters) and algorithm's
control parameters (e.g. initialization).
Requirements for fname
.
fname
is a function implementing the base clustering method of
interest. It must have the following input argument
data:
a numeric vector, matrix, or data frame of observations. Rows
correspond to observations and columns correspond to
variables/features.
Categorical variables and NA
values are not allowed.
Additionally, fname
can have any other input parameter controlling
the underlying clustering model/method/algorithm. All this additional
parameters are passed to mset_user
via ...
(see Arguments).
The output of fname
must contain a list named params
with cluster parameters describing size, centrality and scatter.
Let P=
number of variable/features and K=
number of clusters.
The elements of params
are as follows:
prop:
a vector of clusters' proportions;
mean:
a matrix of dimension (P x K)
containing the clusters' mean
parameters;
cov:
an array of size (P x P x K)
containing the clusters'
covariance matrices.
Note that params
can be easily obtained from a vector of cluster labels
using clust2params
.
packages
and export
. The user does not
normally need to specify packages
and export
.
These arguments are not needed if the functions generated by mset_user
will be called from an environment containing all variables and
functions needed to execute fname
.
Functions like bqs
will call the functions
by mset_user
within a parallel infrastructure
using foreach
. If the user specifies
packages
and export
, they will be passed to the
.packages
and .export
arguments of
foreach
.
Finally, note that the package already contains specialized versions of mset_user
generating methods settings for some popular algorithms
(see mset_gmix
, mset_kmeans
, mset_pam
)
An S3 object of class 'qcmethod'
. Each element of the list
represents a competing method containing the following objects
fullname |
a string identifying the setup. |
callargs |
a list with arguments that are passed to the base function. |
fn |
the function implementing the specified setting. This |
Coraggio, Luca, and Pietro Coretto (2023). Selecting the Number of Clusters, Clustering Models, and Algorithms. A Unifying Approach Based on the Quadratic Discriminant Score. Journal of Multivariate Analysis, Vol. 196(105181), pp. 1-20, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2023.105181")}
clust2params
, mset_gmix
, mset_kmeans
, mset_pam
# load data
data("banknote")
dat <- banknote[-1]
# EXAMPLE 1: generate Hierarchical Clustering settings
# ----------------------------------------------------
# wrapper for the popular stats::hclust() for Hierarchical Clustering
# Note the usee:
# of the optional arguments '...' passed to the underling clustering function
# the use of 'clust2params' to add cluster parameters to the output
hc_wrapper <- function(data, K, ...){
dm <- dist(data, method = "euclidean")
## ... = hc parameters
hc <- hclust(dm, ...)
cl <- cutree(hc, k = K)
## output with params
res <- list()
res$cluster <- cl
res$params <- clust2params(data, cluster = cl)
return(res)
}
# generate settings for Hierarchical Clustering with varying
# number of clusters K={3,4}, agglomeration method = {ward.D, median}
# see help('stats::hclust')
A <- mset_user(fname="hc_wrapper", K = c(2,3), method = c("ward.D", "complete"))
# get the setting with K=2 and method = "complete"
ma <- A[[4]]
ma
# cluster data with M[[3]]
fit_a1 <- ma$fn(dat)
fit_a1
## if only cluster parameters are needed
fit_a2 <- ma$fn(dat, only_params = TRUE)
fit_a2
## Not run:
# EXAMPLE 2: generate 'mclust' model settings
# -------------------------------------------
# mclust is popular package for performing model based clustering based on
# Gaussian mixture. Please visit
# https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html
require(mclust)
# wrapper for the popular stats::hclust() for Hierarchical Clustering
# Notes:
# * optional arguments '...' are passed to the underling
# 'mclust' clustering function
# * 'mclust' fits Gaussian Mixture models so cluster parameters are
# contained in the mclust object
mc_wrapper <- function(data, K, ...){
y <- Mclust(data, G = K, ...)
y[["params"]] <- list(proportion = y$parameters$pro,
mean = y$parameters$mean,
cov = y$parameters$variance$sigma)
return(y)
}
# generate 'mclust' model settings by varying the number of clusters and
# covariance matrix models (see help('mclust::mclustModelNames'))
B <- mset_user(fname = "mc_wrapper", K = c(2,3), modelNames = c("EEI", "VVV"))
# get the setting with K=3 and covariance model "EEI"
mb <- B[[2]]
mb
# cluster data with M[[3]]
fit_b <- mb$fn(dat)
fit_b ## class(fit_b) = "Mclust"
# if needed one can make sure that 'mclust' package is always available
# by setting the argument 'packages'
B <- mset_user(fname = "mc_wrapper", K = c(2,3), modelNames = c("EEI","VVV"),
packages=c("mclust"))
## End(Not run)
## Not run:
# EXAMPLE 3: generate 'dbscan' settings
# -------------------------------------
# DBSCAN is popular nonparametric method for discovering clusters of
# arbitrary shapes with noise. The number of clusters is implicitly
# determined via two crucial tunings usually called 'eps' and 'minPts'
# See https://en.wikipedia.org/wiki/DBSCAN
require(dbscan)
# wrapper for dbscan::dbscan
db_wrap <- function(data, ...) {
cl <- dbscan(data, borderPoints = TRUE, ...)$cluster
return(params = clust2params(data, cl))
}
D <- mset_user(fname = "db_wrap", eps = c(0.5, 1), minPts=c(5,10))
md <- D[[2]]
fit_d <- md$fn(dat)
fit_d
class(fit_d)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.