compare_clusterings: Compare different clustering configurations
In dtwclust: Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance

View source: R/CLUSTERING-compare-clusterings.R

compare_clusterings

R Documentation

Compare different clustering configurations

Description

Compare many different clustering algorithms with support for parallelization.

Usage

compare_clusterings(
  series = NULL,
  types = c("p", "h", "f", "t"),
  configs = compare_clusterings_configs(types),
  seed = NULL,
  trace = FALSE,
  ...,
  score.clus = function(...) stop("No scoring"),
  pick.clus = function(...) stop("No picking"),
  shuffle.configs = FALSE,
  return.objects = FALSE,
  packages = character(0L),
  .errorhandling = "stop"
)

Arguments

`series`	A list of series, a numeric matrix or a data frame. Matrices and data frames are coerced to a list row-wise (see `tslist()`).
`types`	Clustering types. It must be any combination of (possibly abbreviated): "partitional", "hierarchical", "fuzzy", "tadpole."
`configs`	The list of data frames with the desired configurations to run. See `pdc_configs()` and `compare_clusterings_configs()`.
`seed`	Seed for random reproducibility.
`trace`	Logical indicating that more output should be printed to screen.
`...`	Further arguments for `tsclust()`, `score.clus` or `pick.clus`.
`score.clus`	A function that gets the list of results (and `...`) and scores each one. It may also be a named list of functions, one for each type of clustering. See Scoring section.
`pick.clus`	A function to pick the best result. See Picking section.
`shuffle.configs`	Randomly shuffle the order of configs, which can be useful to balance load when using parallel computation.
`return.objects`	Logical indicating whether the objects returned by `tsclust()` should be given in the result.
`packages`	A character vector with the names of any packages needed for any functions used (distance, centroid, preprocessing, etc.). The name "dtwclust" is added automatically. Relevant for parallel computation.
`.errorhandling`	This will be passed to `foreach::foreach()`. See Parallel section below.

Details

This function calls tsclust() with different configurations and evaluates the results with the provided functions. Parallel support is included. See the examples.

Parameters specified in configs whose values are NA will be ignored automatically.

The scoring and picking functions are for convenience, if they are not specified, the scores and pick elements of the result will be NULL.

See repeat_clustering() for when return.objects = FALSE.

Value

A list with:

results: A list of data frames with the flattened configs and the corresponding scores returned by score.clus.
scores: The scores given by score.clus.
pick: The object returned by pick.clus.
proc_time: The measured execution time, using base::proc.time().
seeds: A list of lists with the random seeds computed for each configuration.

The cluster objects are also returned if return.objects = TRUE.

Parallel computation

The configurations for each clustering type can be evaluated in parallel (multi-processing) with the foreach package. A parallel backend can be registered, e.g., with doParallel.

If the .errorhandling parameter is changed to "pass" and a custom score.clus function is used, said function should be able to deal with possible error objects.

If it is changed to "remove", it might not be possible to attach the scores to the results data frame, or it may be inconsistent. Additionally, if return.objects is TRUE, the names given to the objects might also be inconsistent.

Parallelization can incur a lot of deep copies of data when returning the cluster objects, since each one will contain a copy of datalist. If you want to avoid this, consider specifying score.clus and setting return.objects to FALSE, and then using repeat_clustering().

Scoring

The clustering results are organized in a list of lists in the following way (where only applicable types exist; first-level list names in bold):

partitional - list with
- Clustering results from first partitional config
- etc.
hierarchical - list with
- Clustering results from first hierarchical config
- etc.
fuzzy - list with
- Clustering results from first fuzzy config
- etc.
tadpole - list with
- Clustering results from first tadpole config
- etc.

If score.clus is a function, it will be applied to the available partitional, hierarchical, fuzzy and/or tadpole results via:

scores <- lapply(list_of_lists, score.clus, ...)

Otherwise, score.clus should be a list of functions with the same names as the list above, so that score.clus$partitional is used to score list_of_lists$partitional and so on (via base::Map()).

Therefore, the scores returned shall always be a list of lists with first-level names as above.

Picking

If return.objects is TRUE, the results' data frames and the list of TSClusters objects are given to pick.clus as first and second arguments respectively, followed by .... Otherwise, pick.clus will receive only the data frames and the contents of ... (since the objects will not be returned by the preceding step).

Limitations

Note that the configurations returned by the helper functions assign special names to preprocessing/distance/centroid arguments, and these names are used internally to recognize them.

If some of these arguments are more complex (e.g. matrices) and should not be expanded, consider passing them directly via the ellipsis (...) instead of using pdc_configs(). This assumes that said arguments can be passed to all functions without affecting their results.

The distance matrices (if calculated) are not re-used across configurations. Given the way the configurations are created, this shouldn't matter, because clusterings with arguments that can use the same distance matrix are already grouped together by compare_clusterings_configs() and pdc_configs().

Author(s)

Alexis Sarda-Espinosa

Examples

# Fuzzy preprocessing: calculate autocorrelation up to 50th lag
acf_fun <- function(series, ...) {
    lapply(series, function(x) {
        as.numeric(acf(x, lag.max = 50, plot = FALSE)$acf)
    })
}

# Define overall configuration
cfgs <- compare_clusterings_configs(
    types = c("p", "h", "f", "t"),
    k = 19L:20L,
    controls = list(
        partitional = partitional_control(
            iter.max = 30L,
            nrep = 1L
        ),
        hierarchical = hierarchical_control(
            method = "all"
        ),
        fuzzy = fuzzy_control(
            # notice the vector
            fuzziness = c(2, 2.5),
            iter.max = 30L
        ),
        tadpole = tadpole_control(
            # notice the vectors
            dc = c(1.5, 2),
            window.size = 19L:20L
        )
    ),
    preprocs = pdc_configs(
        type = "preproc",
        # shared
        none = list(),
        zscore = list(center = c(FALSE)),
        # only for fuzzy
        fuzzy = list(
            acf_fun = list()
        ),
        # only for tadpole
        tadpole = list(
            reinterpolate = list(new.length = 205L)
        ),
        # specify which should consider the shared ones
        share.config = c("p", "h")
    ),
    distances = pdc_configs(
        type = "distance",
        sbd = list(),
        fuzzy = list(
            L2 = list()
        ),
        share.config = c("p", "h")
    ),
    centroids = pdc_configs(
        type = "centroid",
        partitional = list(
            pam = list()
        ),
        # special name 'default'
        hierarchical = list(
            default = list()
        ),
        fuzzy = list(
            fcmdd = list()
        ),
        tadpole = list(
            default = list(),
            shape_extraction = list(znorm = TRUE)
        )
    )
)

# Number of configurations is returned as attribute
num_configs <- sapply(cfgs, attr, which = "num.configs")
cat("\nTotal number of configurations without considering optimizations:",
    sum(num_configs),
    "\n\n")

# Define evaluation functions based on CVI: Variation of Information (only crisp partition)
vi_evaluators <- cvi_evaluators("VI", ground.truth = CharTrajLabels)
score_fun <- vi_evaluators$score
pick_fun <- vi_evaluators$pick

# ====================================================================================
# Short run with only fuzzy clustering
# ====================================================================================

comparison_short <- compare_clusterings(CharTraj, types = c("f"), configs = cfgs,
                                        seed = 293L, trace = TRUE,
                                        score.clus = score_fun, pick.clus = pick_fun,
                                        return.objects = TRUE)

## Not run: 
# ====================================================================================
# Parallel run with all comparisons
# ====================================================================================

require(doParallel)
registerDoParallel(cl <- makeCluster(detectCores()))

comparison_long <- compare_clusterings(CharTraj, types = c("p", "h", "f", "t"),
                                       configs = cfgs,
                                       seed = 293L, trace = TRUE,
                                       score.clus = score_fun,
                                       pick.clus = pick_fun,
                                       return.objects = TRUE)

# Using all external CVIs and majority vote
external_evaluators <- cvi_evaluators("external", ground.truth = CharTrajLabels)
score_external <- external_evaluators$score
pick_majority <- external_evaluators$pick

comparison_majority <- compare_clusterings(CharTraj, types = c("p", "h", "f", "t"),
                                           configs = cfgs,
                                           seed = 84L, trace = TRUE,
                                           score.clus = score_external,
                                           pick.clus = pick_majority,
                                           return.objects = TRUE)

# best results
plot(comparison_majority$pick$object)
print(comparison_majority$pick$config)

stopCluster(cl); registerDoSEQ()

# ====================================================================================
# A run with only partitional clusterings
# ====================================================================================

p_cfgs <- compare_clusterings_configs(
    types = "p", k = 19L:21L,
    controls = list(
        partitional = partitional_control(
            iter.max = 20L,
            nrep = 8L
        )
    ),
    preprocs = pdc_configs(
        "preproc",
        none = list(),
        zscore = list(center = c(FALSE, TRUE))
    ),
    distances = pdc_configs(
        "distance",
        sbd = list(),
        dtw_basic = list(window.size = 19L:20L,
                         norm = c("L1", "L2")),
        gak = list(window.size = 19L:20L,
                   sigma = 100)
    ),
    centroids = pdc_configs(
        "centroid",
        partitional = list(
            pam = list(),
            shape = list()
        )
    )
)

# Remove redundant (shape centroid always uses zscore preprocessing)
id_redundant <- p_cfgs$partitional$preproc == "none" &
    p_cfgs$partitional$centroid == "shape"
p_cfgs$partitional <- p_cfgs$partitional[!id_redundant, ]

# LONG! 30 minutes or so, sequentially
comparison_partitional <- compare_clusterings(CharTraj, types = "p",
                                              configs = p_cfgs,
                                              seed = 32903L, trace = TRUE,
                                              score.clus = score_fun,
                                              pick.clus = pick_fun,
                                              shuffle.configs = TRUE,
                                              return.objects = TRUE)

## End(Not run)

dtwclust documentation built on Sept. 11, 2024, 9:07 p.m.

dtwclust index

Package overview Comparing Time-Series Clustering Algorithms in R Using the dtwclust Package Parallelization considerations for dtwclust Timing experiments for dtwclust

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dtwclust
Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance

compare_clusterings: Compare different clustering configurations
In dtwclust: Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance

Compare different clustering configurations

Description

Usage

Arguments

Details

Value

Parallel computation

Scoring

Picking

Limitations

Author(s)

See Also

Examples

Related to compare_clusterings in dtwclust...

R Package Documentation

Browse R Packages

We want your feedback!

dtwclust Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance

compare_clusterings: Compare different clustering configurations In dtwclust: Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance

Compare different clustering configurations

Description

Usage

Arguments

Details

Value

Parallel computation

Scoring

Picking

Limitations

Author(s)

See Also

Examples

Related to compare_clusterings in dtwclust...

R Package Documentation

Browse R Packages

We want your feedback!

dtwclust
Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance

compare_clusterings: Compare different clustering configurations
In dtwclust: Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance