seqclararange: CLARA Clustering for Sequence Analysis
In WeightedCluster: Clustering of Weighted Data

seqclararange

R Documentation

CLARA Clustering for Sequence Analysis

Description

Cluster large databases of sequences for a different number of groups using the CLARA algorithm based on subsampling to reduce computational burden. Crisp, fuzzy and representativeness clustering are available. The function further computes several cluster quality measures.

Usage

seqclararange(seqdata, R = 100, sample.size = 40 + 2 * max(kvals), 
  kvals = 2:10,  seqdist.args = list(method = "LCS"),
  method=c("crisp", "fuzzy", "representativeness", "noise"), 
  m = 1.5, criteria = c("distance"), stability = FALSE, dnoise=NULL,
  parallel = FALSE, progressbar = FALSE, keep.diss = FALSE, 
  max.dist = NULL)

Arguments

`seqdata`	State sequence object of class `stslist`. The sequence data to use. Use `seqdef` to create such an object.
`R`	Numeric. The number of subsamples to use.
`sample.size`	Numeric. The size of the subsamples, the default values is the one proposed by Kaufmann and Rousseuuw (1990). However, larger values (typically between 1000 and 10 000) are recommended.
`kvals`	Numeric vector. The different number of groups to compute.
`seqdist.args`	List of arguments passed to `seqdist` for computing the distances.
`method`	Character. The clustering approach to use, with default to "crisp" clustering. "fuzzy", "noise" or "representativeness" approaches can also be used.
`m`	Numeric. Only used for fuzzy clustering, the value of the fuzzifier.
`criteria`	Character. The name of the criteria used for selecting the best clustering among the different runs. The following values are accepted: "distance" (Default, average value to cluster medoids), "db" (Davies-Bouldin Index), "xb" (Xie-Beni index), "pbm" (PBM Index), "ams" (Average medoid silhouette value).
`stability`	Logical. If `TRUE`, stability measures are computed (can be time consuming, especially for fuzzy clustering). Default to `FALSE`.
`dnoise`	Numerical. The theoretically defined distance to the noise cluster. Mandatory for noise clustering.
`parallel`	Logical. Whether to initialize the parallel processing of the `future` package using the default `multisession` strategy. If `FALSE` (default), then the current `plan` is used. If `TRUE`, `multisession` `plan` is initialized using default values.
`progressbar`	Logical. Whether to initialize a progressbar using the `future` package. If `FALSE` (default), then the current progress bar `handlers` is used . If `TRUE`, a new global progress bar `handlers` is initialized.
`keep.diss`	Logical. Whether to keep the distances to the medoids. Set to `FALSE` by default.
`max.dist`	Numeric. Maximal theoretical distance value between sequences. Required for `method="representativeness"` clustering.

Details

seqclararange relies on the CLARA algorithm to cluster large database. The algorithm works as follows.

Randomly take a subsample of the data of size sample.size.
Cluster the subsample using the PAM algorithm initialized using Ward to speed up the computations (see wcKmedoids).
Use the identified medoids to assign cluster membership in the whole dataset.
Evaluate the resulting clustering using a criteria (see argument), the average distances to the medoids by default.

These steps are repeated R times and the best solution according to the given criterion is kept.

To minimize the computation, the operation is repeated for different number of groups, which then allows to choose the best number of groups according to different cluster quality indices. The following indices are computed automatically: "Avg dist" (Average distance to cluster medoids), "PBM"(PBM Index), "DB" (Davies-Bouldin Index), "XB" (Xie-Beni Index), "AMS" (Average medoid silhouette width), "ARI>0.8" (Number of iteration similar to the current best, only if stability=TRUE, "JC>0.8" (Number of iteration similar to the current best, only if stability=TRUE.

Value

A seqclararange object with the following components:

`kvals:`	The different number of groups evaluated.
`clustering:`	The retained clustering for each number of groups. For `"crisp"` clustering, a `data.frame` with the clustering in column named clusterX, with X the number of groups. For `"fuzzy"` and `"representativeness"`, a list of membership matrix, with each elements named clusterX, with X the number of groups.
`stats`:	A `matrix` containing the clustering statistics of each cluster solution.
`clara:`	Detailed information on the best clustering for each number of groups, in the same order as kvals.

References

Studer, M., R. Sadeghi and L. Tochon (2024). Sequence Analysis for Large Databases. LIVES Working Papers 104 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.12682/lives.2296-1658.2024.104")}

Examples

data(biofam) #load illustrative data
## Defining the new state labels 
statelab <- c("Parent", "Left", "Married", "Left/Married",  "Child", 
            "Left/Child", "Left/Married/Child", "Divorced")
## Creating the state sequence object,
biofam.seq <- seqdef(biofam[1:100, 10:25], alphabet=0:7, states=statelab)



## Clara clustering
bfclara <- seqclararange(biofam.seq, R = 3, sample.size = 10, kvals = 2:3, 
  seqdist.args = list(method = "HAM"), parallel=FALSE, 
  stability=TRUE)


#Show the cluster quality measures.
bfclara
#Plot and normalize the values for easier identification of minimum and maximum values.
plot(bfclara, norm="range")
## Stability values.
plot(bfclara, stat="stabmean")
plot(bfclara, stat="stability")

seqdplot(biofam.seq, group=bfclara$clustering$cluster3)

## Cluster quality indices estimation using boostrap

bCQI <- bootclustrange(bfclara, biofam.seq, seqdist.args = list(method = "HAM"), 
  R = 3, sample.size = 10,  parallel=FALSE)

bCQI
plot(bCQI, norm="zscore")

## Not run: 
## Fuzzy clustering
bfclaraf <- seqclararange(biofam.seq, R = 3, sample.size = 20, kvals = 2:3, 
  method="fuzzy", seqdist.args = list(method = "HAM"), 
	parallel=FALSE)


bfclaraf
plot(bfclaraf, norm="zscore")


fuzzyseqplot(biofam.seq, group=bfclaraf$clustering$cluster3, type="I", 
  sortv="membership", membership.threashold=0.2)

## Noise clustering
bfclaran <- seqclararange(biofam.seq, R = 3, sample.size = 20, kvals = 2:3, 
  method="noise", seqdist.args = list(method = "HAM"), dnoise=6,
	parallel=FALSE)

fuzzyseqplot(biofam.seq, group=bfclaran$clustering$cluster3, type="I", 
  sortv="membership", membership.threashold=0.2)


## End(Not run)

WeightedCluster documentation built on April 12, 2025, 9:13 a.m.