mset_user: Generates Clustering Methods Settings for a Prototype...

View source: R/mset_user.R

mset_userR Documentation

Generates Clustering Methods Settings for a Prototype Methodology Provided by the User

Description

The function generates a software abstraction of a list of clustering models implemented through the a set of tuned methods and algorithms. The base clustering methodology is provided via a user-defined function. The latter prototype is exapanded in a list of fucntions each combining tuning parameters and other algorithmic settings. The generated functions are ready to be called on the data set.

Usage

mset_user(fname, .packages = NULL, .export = NULL, ...)

Arguments

fname

a function implementing a user-defined clustering method. It clusters a data set and outputs cluster parameters. fname must fulfill certain requirements detailed below in the Details.

.packages

character vector of packages that the tasks in fname depend on (see Details).

.export

character vector of variables to export that are needed by fname and that are not defined in the current environment (see Details).

...

parameters passed to fname. If a given parameter is included as a vector/list each of its members is to obtain the final collection of fname specifications (see Details and Examples).

Details

The function produces functions implementing competing clustering methods based on a prototype methodology implemented by the user via the input argument fname. In particular, it builds a list of fname-type functions each corresponding to a specific setup in terms of hyper-parameters (e.g. the number of clusters) and algorithm's control parameters (e.g. initialization).

Requirements for fname. fname is a function implementing the base clustering method of interest. It must have the following input argument

  • data: a numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Categorical variables and NA values are not allowed.

Additionally, fname can have any other input parameter controlling the underlying clustering model/method/algorithm. All this additional parameters are passed to mset_user via ... (see Arguments).

The output of fname must contain a list named params with cluster parameters describing size, centrality and scatter. Let P=number of variable/features and K=number of clusters. The elements of params are as follows:

  • prop: a vector of clusters' proportions;

  • mean: a matrix of dimension (P x K) containing the clusters' mean parameters;

  • cov: an array of size (P x P x K) containing the clusters' covariance matrices.

Note that params can be easily obtained from a vector of cluster labels using clust2params.

packages and export. The user does not normally need to specify packages and export. These arguments are not needed if the functions generated by mset_user will be called from an environment containing all variables and functions needed to execute fname. Functions like bqs will call the functions by mset_user within a parallel infrastructure using foreach. If the user specifies packages and export, they will be passed to the .packages and .export arguments of foreach.

Finally, note that the package already contains specialized versions of mset_user generating methods settings for some popular algorithms (see mset_gmix, mset_kmeans, mset_pam)

Value

An S3 object of class 'qcmethod'. Each element of the list represents a competing method containing the following objects

fullname

a string identifying the setup.

callargs

a list with arguments that are passed to the base function.

fn

the function implementing the specified setting. This fn function can be executed on the data set. It has two arguments: data and only_params. data is a data matrix or data.frame only_params is logical. If only_params==FALSE (default), fn will return the object returned by the fname. If only_params==TRUE (default) fn will return only cluster parameters (proportions, mean, and cov, see clust2params).

References

Coraggio, Luca, and Pietro Coretto (2023). Selecting the Number of Clusters, Clustering Models, and Algorithms. A Unifying Approach Based on the Quadratic Discriminant Score. Journal of Multivariate Analysis, Vol. 196(105181), pp. 1-20, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2023.105181")}

See Also

clust2params, mset_gmix, mset_kmeans, mset_pam

Examples

# load data
data("banknote")
dat  <- banknote[-1]



# EXAMPLE 1: generate Hierarchical Clustering settings
# ----------------------------------------------------

# wrapper for the popular stats::hclust() for Hierarchical Clustering
# Note the usee:
#   of the optional arguments '...' passed to the underling clustering function
#   the use of 'clust2params' to add cluster parameters to the output 
hc_wrapper <- function(data, K, ...){ 
    dm  <- dist(data, method = "euclidean")
    ## ... = hc parameters 
    hc  <- hclust(dm, ...)
    cl  <- cutree(hc, k = K)
    ## output with params 
    res          <- list()
    res$cluster  <- cl
    res$params   <- clust2params(data, cluster = cl)
    return(res)
}

# generate settings for Hierarchical Clustering with varying
# number of clusters K={3,4},  agglomeration method = {ward.D, median}
# see help('stats::hclust')
A <- mset_user(fname="hc_wrapper", K = c(2,3), method = c("ward.D", "complete"))

# get the setting with K=2 and method = "complete"
ma <- A[[4]]
ma

# cluster data with M[[3]]
fit_a1 <- ma$fn(dat)
fit_a1

## if only cluster parameters are needed 
fit_a2 <- ma$fn(dat, only_params = TRUE)
fit_a2


 
## Not run: 
# EXAMPLE 2: generate 'mclust' model settings 
# -------------------------------------------
# mclust is popular package for performing model based clustering based on
# Gaussian mixture. Please visit
# https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html
require(mclust)

# wrapper for the popular stats::hclust() for Hierarchical Clustering
# Notes:
#  * optional arguments '...' are passed to the underling
#    'mclust' clustering function
#  * 'mclust' fits Gaussian Mixture models so cluster parameters are 
#     contained in the mclust object  
mc_wrapper <- function(data, K, ...){
    y <- Mclust(data, G = K, ...)
    y[["params"]] <- list(proportion = y$parameters$pro,
                          mean = y$parameters$mean,
                          cov = y$parameters$variance$sigma)
    return(y)
    }

# generate 'mclust' model settings by varying the number of clusters and
# covariance matrix models (see help('mclust::mclustModelNames'))
B <- mset_user(fname = "mc_wrapper", K = c(2,3), modelNames = c("EEI", "VVV"))

    
# get the setting with K=3 and covariance model "EEI"
mb <- B[[2]]
mb

# cluster data with M[[3]]
fit_b <- mb$fn(dat)
fit_b ## class(fit_b) = "Mclust"

   
# if needed one can make sure that 'mclust' package is always available
# by setting the argument 'packages'
B <- mset_user(fname = "mc_wrapper", K = c(2,3), modelNames = c("EEI","VVV"),
               packages=c("mclust"))

## End(Not run)


## Not run: 
# EXAMPLE 3: generate 'dbscan' settings 
# -------------------------------------
# DBSCAN is popular nonparametric method for discovering clusters of
# arbitrary shapes with noise. The number of clusters is implicitly
# determined via two crucial tunings usually called 'eps' and 'minPts'
# See https://en.wikipedia.org/wiki/DBSCAN
require(dbscan)

# wrapper for dbscan::dbscan
db_wrap <- function(data, ...) {
  cl <- dbscan(data, borderPoints = TRUE, ...)$cluster
  return(params = clust2params(data, cl))
}

D  <- mset_user(fname = "db_wrap", eps = c(0.5, 1), minPts=c(5,10))
md    <- D[[2]]
fit_d <- md$fn(dat)
fit_d
class(fit_d)

## End(Not run)

qcluster documentation built on April 3, 2025, 6:16 p.m.

Related to mset_user in qcluster...