seqclararange: CLARA Clustering for Sequence Analysis

View source: R/clara_clustrange.R

seqclararangeR Documentation

CLARA Clustering for Sequence Analysis

Description

Cluster large databases of sequences for a different number of groups using the CLARA algorithm based on subsampling to reduce computational burden. Crisp, fuzzy and representativeness clustering are available. The function further computes several cluster quality measures.

Usage

seqclararange(seqdata, R = 100, sample.size = 40 + 2 * max(kvals), 
  kvals = 2:10,  seqdist.args = list(method = "LCS"),
  method=c("crisp", "fuzzy", "representativeness", "noise"), 
  m = 1.5, criteria = c("distance"), stability = FALSE, dnoise=NULL,
  parallel = FALSE, progressbar = FALSE, keep.diss = FALSE, 
  max.dist = NULL)

Arguments

seqdata

State sequence object of class stslist. The sequence data to use. Use seqdef to create such an object.

R

Numeric. The number of subsamples to use.

sample.size

Numeric. The size of the subsamples, the default values is the one proposed by Kaufmann and Rousseuuw (1990). However, larger values (typically between 1000 and 10 000) are recommended.

kvals

Numeric vector. The different number of groups to compute.

seqdist.args

List of arguments passed to seqdist for computing the distances.

method

Character. The clustering approach to use, with default to "crisp" clustering. "fuzzy", "noise" or "representativeness" approaches can also be used.

m

Numeric. Only used for fuzzy clustering, the value of the fuzzifier.

criteria

Character. The name of the criteria used for selecting the best clustering among the different runs. The following values are accepted: "distance" (Default, average value to cluster medoids), "db" (Davies-Bouldin Index), "xb" (Xie-Beni index), "pbm" (PBM Index), "ams" (Average medoid silhouette value).

stability

Logical. If TRUE, stability measures are computed (can be time consuming, especially for fuzzy clustering). Default to FALSE.

dnoise

Numerical. The theoretically defined distance to the noise cluster. Mandatory for noise clustering.

parallel

Logical. Whether to initialize the parallel processing of the future package using the default multisession strategy. If FALSE (default), then the current plan is used. If TRUE, multisession plan is initialized using default values.

progressbar

Logical. Whether to initialize a progressbar using the future package. If FALSE (default), then the current progress bar handlers is used . If TRUE, a new global progress bar handlers is initialized.

keep.diss

Logical. Whether to keep the distances to the medoids. Set to FALSE by default.

max.dist

Numeric. Maximal theoretical distance value between sequences. Required for method="representativeness" clustering.

Details

seqclararange relies on the CLARA algorithm to cluster large database. The algorithm works as follows.

  1. Randomly take a subsample of the data of size sample.size.

  2. Cluster the subsample using the PAM algorithm initialized using Ward to speed up the computations (see wcKmedoids).

  3. Use the identified medoids to assign cluster membership in the whole dataset.

  4. Evaluate the resulting clustering using a criteria (see argument), the average distances to the medoids by default.

These steps are repeated R times and the best solution according to the given criterion is kept.

To minimize the computation, the operation is repeated for different number of groups, which then allows to choose the best number of groups according to different cluster quality indices. The following indices are computed automatically: "Avg dist" (Average distance to cluster medoids), "PBM"(PBM Index), "DB" (Davies-Bouldin Index), "XB" (Xie-Beni Index), "AMS" (Average medoid silhouette width), "ARI>0.8" (Number of iteration similar to the current best, only if stability=TRUE, "JC>0.8" (Number of iteration similar to the current best, only if stability=TRUE.

Value

A seqclararange object with the following components:

kvals:

The different number of groups evaluated.

clustering:

The retained clustering for each number of groups. For "crisp" clustering, a data.frame with the clustering in column named clusterX, with X the number of groups. For "fuzzy" and "representativeness", a list of membership matrix, with each elements named clusterX, with X the number of groups.

stats:

A matrix containing the clustering statistics of each cluster solution.

clara:

Detailed information on the best clustering for each number of groups, in the same order as kvals.

References

Studer, M., R. Sadeghi and L. Tochon (2024). Sequence Analysis for Large Databases. LIVES Working Papers 104 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.12682/lives.2296-1658.2024.104")}

See Also

See also as plot.seqclararange to plot the results.

Examples

data(biofam) #load illustrative data
## Defining the new state labels 
statelab <- c("Parent", "Left", "Married", "Left/Married",  "Child", 
            "Left/Child", "Left/Married/Child", "Divorced")
## Creating the state sequence object,
biofam.seq <- seqdef(biofam[1:100, 10:25], alphabet=0:7, states=statelab)



## Clara clustering
bfclara <- seqclararange(biofam.seq, R = 3, sample.size = 10, kvals = 2:3, 
  seqdist.args = list(method = "HAM"), parallel=FALSE, 
  stability=TRUE)


#Show the cluster quality measures.
bfclara
#Plot and normalize the values for easier identification of minimum and maximum values.
plot(bfclara, norm="range")
## Stability values.
plot(bfclara, stat="stabmean")
plot(bfclara, stat="stability")

seqdplot(biofam.seq, group=bfclara$clustering$cluster3)

## Cluster quality indices estimation using boostrap

bCQI <- bootclustrange(bfclara, biofam.seq, seqdist.args = list(method = "HAM"), 
  R = 3, sample.size = 10,  parallel=FALSE)

bCQI
plot(bCQI, norm="zscore")

## Not run: 
## Fuzzy clustering
bfclaraf <- seqclararange(biofam.seq, R = 3, sample.size = 20, kvals = 2:3, 
  method="fuzzy", seqdist.args = list(method = "HAM"), 
	parallel=FALSE)


bfclaraf
plot(bfclaraf, norm="zscore")


fuzzyseqplot(biofam.seq, group=bfclaraf$clustering$cluster3, type="I", 
  sortv="membership", membership.threashold=0.2)

## Noise clustering
bfclaran <- seqclararange(biofam.seq, R = 3, sample.size = 20, kvals = 2:3, 
  method="noise", seqdist.args = list(method = "HAM"), dnoise=6,
	parallel=FALSE)

fuzzyseqplot(biofam.seq, group=bfclaran$clustering$cluster3, type="I", 
  sortv="membership", membership.threashold=0.2)


## End(Not run)

WeightedCluster documentation built on Oct. 2, 2024, 5:06 p.m.