View source: R/clara_clustrange.R
seqclararange | R Documentation |
Cluster large databases of sequences for a different number of groups using the CLARA algorithm based on subsampling to reduce computational burden. Crisp, fuzzy and representativeness clustering are available. The function further computes several cluster quality measures.
seqclararange(seqdata, R = 100, sample.size = 40 + 2 * max(kvals),
kvals = 2:10, seqdist.args = list(method = "LCS"),
method=c("crisp", "fuzzy", "representativeness", "noise"),
m = 1.5, criteria = c("distance"), stability = FALSE, dnoise=NULL,
parallel = FALSE, progressbar = FALSE, keep.diss = FALSE,
max.dist = NULL)
seqdata |
State sequence object of class |
R |
Numeric. The number of subsamples to use. |
sample.size |
Numeric. The size of the subsamples, the default values is the one proposed by Kaufmann and Rousseuuw (1990). However, larger values (typically between 1000 and 10 000) are recommended. |
kvals |
Numeric vector. The different number of groups to compute. |
seqdist.args |
List of arguments passed to |
method |
Character. The clustering approach to use, with default to "crisp" clustering. "fuzzy", "noise" or "representativeness" approaches can also be used. |
m |
Numeric. Only used for fuzzy clustering, the value of the fuzzifier. |
criteria |
Character. The name of the criteria used for selecting the best clustering among the different runs. The following values are accepted: "distance" (Default, average value to cluster medoids), "db" (Davies-Bouldin Index), "xb" (Xie-Beni index), "pbm" (PBM Index), "ams" (Average medoid silhouette value). |
stability |
Logical. If |
dnoise |
Numerical. The theoretically defined distance to the noise cluster. Mandatory for noise clustering. |
parallel |
Logical. Whether to initialize the parallel processing of the |
progressbar |
Logical. Whether to initialize a progressbar using the |
keep.diss |
Logical. Whether to keep the distances to the medoids. Set to |
max.dist |
Numeric. Maximal theoretical distance value between sequences. Required for |
seqclararange
relies on the CLARA algorithm to cluster large database. The algorithm works as follows.
Randomly take a subsample of the data of size sample.size
.
Cluster the subsample using the PAM algorithm initialized using Ward to speed up the computations (see wcKmedoids
).
Use the identified medoids to assign cluster membership in the whole dataset.
Evaluate the resulting clustering using a criteria
(see argument), the average distances to the medoids by default.
These steps are repeated R
times and the best solution according to the given criterion is kept.
To minimize the computation, the operation is repeated for different number of groups, which then allows to choose the best number of groups according to different cluster quality indices. The following indices are computed automatically: "Avg dist"
(Average distance to cluster medoids), "PBM"
(PBM Index), "DB"
(Davies-Bouldin Index), "XB"
(Xie-Beni Index), "AMS"
(Average medoid silhouette width), "ARI>0.8"
(Number of iteration similar to the current best, only if stability=TRUE
, "JC>0.8"
(Number of iteration similar to the current best, only if stability=TRUE
.
A seqclararange
object with the following components:
kvals: |
The different number of groups evaluated. |
clustering: |
The retained clustering for each number of groups. For |
stats : |
A |
clara: |
Detailed information on the best clustering for each number of groups, in the same order as kvals. |
Studer, M., R. Sadeghi and L. Tochon (2024). Sequence Analysis for Large Databases. LIVES Working Papers 104 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.12682/lives.2296-1658.2024.104")}
See also as plot.seqclararange
to plot the results.
data(biofam) #load illustrative data
## Defining the new state labels
statelab <- c("Parent", "Left", "Married", "Left/Married", "Child",
"Left/Child", "Left/Married/Child", "Divorced")
## Creating the state sequence object,
biofam.seq <- seqdef(biofam[1:100, 10:25], alphabet=0:7, states=statelab)
## Clara clustering
bfclara <- seqclararange(biofam.seq, R = 3, sample.size = 10, kvals = 2:3,
seqdist.args = list(method = "HAM"), parallel=FALSE,
stability=TRUE)
#Show the cluster quality measures.
bfclara
#Plot and normalize the values for easier identification of minimum and maximum values.
plot(bfclara, norm="range")
## Stability values.
plot(bfclara, stat="stabmean")
plot(bfclara, stat="stability")
seqdplot(biofam.seq, group=bfclara$clustering$cluster3)
## Cluster quality indices estimation using boostrap
bCQI <- bootclustrange(bfclara, biofam.seq, seqdist.args = list(method = "HAM"),
R = 3, sample.size = 10, parallel=FALSE)
bCQI
plot(bCQI, norm="zscore")
## Not run:
## Fuzzy clustering
bfclaraf <- seqclararange(biofam.seq, R = 3, sample.size = 20, kvals = 2:3,
method="fuzzy", seqdist.args = list(method = "HAM"),
parallel=FALSE)
bfclaraf
plot(bfclaraf, norm="zscore")
fuzzyseqplot(biofam.seq, group=bfclaraf$clustering$cluster3, type="I",
sortv="membership", membership.threashold=0.2)
## Noise clustering
bfclaran <- seqclararange(biofam.seq, R = 3, sample.size = 20, kvals = 2:3,
method="noise", seqdist.args = list(method = "HAM"), dnoise=6,
parallel=FALSE)
fuzzyseqplot(biofam.seq, group=bfclaran$clustering$cluster3, type="I",
sortv="membership", membership.threashold=0.2)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.