synth: Illustrative dataset: sequences of five states
In ClickClust: Model-Based Clustering of Categorical Sequences

synth

R Documentation

Illustrative dataset: sequences of five states

Description

The data represents the synthetic dataset used as an illustrative example in the Journal of Statistical Software paper discussing the use of the package.
There are 5 states denoted as A, B, C, D, and E. Categorical sequences have lengths varying from 10 to 50.

Usage

data(synth)

Format

$data contains a vector of 250 strings representing categorical sequences; $id is the original classification vector.

Source

Melnykov, V. (2015)

References

Melnykov, V. (2016) Model-Based Biclustering of Clickstream Data, Computational Statistics and Data Analysis, 93, 31-45.

Melnykov, V. (2016) ClickClust: An R Package for Model-Based Clustering of Categorical Sequences, Journal of Statistical Software, 74, 1-34.

Examples


data(synth)
head(synth$data)

# FUNCTION THAT REPLACES CHARACTER STATES WITH NUMERIC VALUES
repl.levs <- function(x, ch.lev){
	for (j in 1:length(ch.lev)) x <- gsub(ch.levs[j], j, x)
	return(x)
}

# DETECT ALL STATES IN THE DATASET
d <- paste(synth$data, collapse = " ")
d <- strsplit(d, " ")[[1]]
ch.levs <- levels(as.factor(d))

# CONVERT DATA TO THE FORM USED BY click.read()
S <- strsplit(synth$data, " ")
S <- sapply(S, repl.levs, ch.levs)
S <- sapply(S, as.numeric)
head(S)

ClickClust documentation built on June 22, 2024, 12:23 p.m.