bootclustrange: Cluster Quality Indices estimation by subsampling
In WeightedCluster: Clustering of Weighted Data

bootclustrange

R Documentation

Cluster Quality Indices estimation by subsampling

Description

bootclustrange estimates the quality of the clustering based on subsamples of the data to avoid computational overload.

Usage

bootclustrange(object, seqdata, seqdist.args = list(method = "LCS"),
               R = 100, sample.size = 1000, parallel = FALSE,
               progressbar = FALSE, sampling = "clustering",
               strata = NULL)
## S3 method for class 'bootclustrange'
plot(x, stat = "noCH", legendpos = "bottomright",
                              norm = "none", withlegend = TRUE, lwd = 1,
                              col = NULL, ylab = "Indicators", 
                              xlab = "N clusters", conf.int = 0.95, 
                              ci.method = "perc", ci.alpha = 0.3, 
                              line = "median", ...)
## S3 method for class 'bootclustrange'
print(x, digits = 2, bootstat = c("mean"), ...)

Arguments

`object`	A `seqclararange` `object` or a `data.frame` with the clustering to be evaluated.
`seqdata`	State sequence object of class `stslist`. The sequence data to use. Use `seqdef` to create such an object.
`seqdist.args`	List of arguments passed to `seqdist` for computing the distances.
`R`	Numeric. The number of subsamples to use.
`sample.size`	Numeric. The size of the subsamples, values between 1000 and 10 000 are recommended.
`parallel`	Logical. Whether to initialize the parallel processing of the `future` package using the default `multisession` strategy. If `FALSE` (default), then the current `plan` is used. If `TRUE`, `multisession` `plan` is initialized using default values.
`progressbar`	Logical. Whether to initialize a progressbar using the `future` package. If `FALSE` (default), then the current progress bar `handlers` is used . If `TRUE`, a new global progress bar `handlers` is initialized.
`sampling`	Character. The sampling procedure to be used: `"clustering"` (default) the sampling is stratified by the maximum number of clusters, use `"medoids"` to add the medoids in each subsamples, `"strata"` to stratify by the `strata` arguments, or `"random"` for random sampling.
`strata`	An optional stratification variable.
`x`	A `bootclustrange` object to be plotted or printed.
`stat`	Character. The list of statistics to plot or "noCH" to plot all statistics except "CH" and "CHsq" or "all" for all statistics. See `as.clustrange` for a list of possible values.
`legendpos`	Character. legend position, see `legend`.
`norm`	Character. Normalization method of the statistics can be one of "none" (no normalization), "range" (given as (value -min)/(max-min), "zscore" (adjusted by mean and standard deviation) or "zscoremed" (adjusted by median and median of the difference to the median).
`withlegend`	Logical. If `FALSE`, the legend is not plotted.
`lwd`	Numeric. Line width, see `par`.
`col`	A vector of line colors, see `par`. If `NULL`, a default set of color is used.
`xlab`	x axis label.
`ylab`	y axis label.
`conf.int`	Confidence to build the confidence interval (default: 0.95).
`ci.method`	Method used to build the confidence interval (only if bootstrap has been used, see R above). One of "none" (do not plot confidence interval), "norm" (based on normal approximation), "perc" (default, based on percentile).)
`ci.alpha`	alpha color value used to plot the interval.
`line`	Which value should be plotted by the line? One of "mean" (average over all bootstraps), "median"(default, median over all bootstraps).
`digits`	Number of digits to be printed.
`bootstat`	The summary statistic to use `"mean"` or `"median"`.
`...`	Additionnal parameters passed to/from methods.

Details

bootclustrange estimates the quality of the clustering based on subsamples of the data to avoid computational overload. It randomly samples R times sample.size sequences from seqdata using the sampling procedure defined by the sampling arguments. In each subsample, a distance matrix is computed using the selected sequences and the seqdist.args arguments and the cluster quality indices are then estimated using as.clustrange.

The clustering can be specified either as a seqclararange object or a data.frame.

Value

A clustrange object, see as.clustrange with the bootrapped values.

References

Studer, M., R. Sadeghi and L. Tochon (2024). Sequence Analysis for Large Databases. LIVES Working Papers 104 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.12682/lives.2296-1658.2024.104")}