ssdtwclust: A shiny app for semi-supervised DTW-based clustering

View source: R/SHINY-ssdtwclust.R

ssdtwclustR Documentation

A shiny app for semi-supervised DTW-based clustering

Description

Display a shiny user interface that implements the approach in Dau et al. (2016).

Usage

ssdtwclust(series, ..., complexity = NULL)

Arguments

series

Time series in the formats accepted by compare_clusterings().

...

More arguments for shiny::runApp().

complexity

A function to calculate a constraint's complexity. See details in the Cluster section.

Details

The approach developed in Dau et al. (2016) argues that finding a good value of window.size for the DTW distance is very important, and suggests how to find one by using user-provided feedback. After clustering is done, a pair of series is presented at a time, and the user must annotate the pair as:

  • Must link: the series should be in the same cluster.

  • Cannot link: the series should not be in the same cluster.

  • Skip: the choice is unclear.

After each step, a good value of the window size is suggested by evaluating which clusterings fulfill the constraint(s) so far, and how (see Dau et al. (2016) for more information), and performing a majority vote using the window sizes inferred from each constraint. The (main) procedure is thus interactive and can be abandoned at any point.

Explore

This part of the app is simply to see some basic characteristics of the provided series and plot some of them. The field for integer IDs expects a valid R expression that specifies which of the series should be plotted. Multivariate series are plotted with each variable in a different facet.

Cluster

This part of the app implements the main procedure by leveraging compare_clusterings(). The interface is similar to interactive_clustering(), so it's worth checking its documentation too. Since compare_clusterings() supports parallelization with foreach::foreach(), you can register a parallel backend before opening the shiny app, but you should pre-load the workers with the necessary packages and/or functions. See parallel::clusterEvalQ() and parallel::clusterExport(), as well as the examples below.

The range of window sizes is specified with a slider, and represents the size as a percentage of the shortest series' length. The step parameter indicates how spaced apart should the sizes be (parameter 'by' in base::seq()). A 0-size window should only be used if all series have the same length. If the series have different lengths, using small window sizes can be problematic if the length differences are very big, see the notes in dtw_basic().

A window.size should not be specified in the extra parameters, it will be replaced with the computed values based on the slider. Using dba() centroid is detected, and will use the same window sizes.

For partitional clusterings with many repetitions, and hierarchical clusterings with many linkage methods, the resulting partitions are aggregated by calling clue::cl_medoid() with the specified aggregation method.

By default, complexity of a constraint is calculated differently from what is suggested in Dau et al. (2016):

  • Allocate a logical flag vector with length equal to the number of tested window sizes.

  • For each window size, set the corresponding flag to TRUE if the constraint given by the user is fulfilled.

  • Calculate complexity as: (number of sign changes in the vector) / (number of window sizes - 1L) / (maximum number of contiguous TRUE flags).

You can provide your own function in the complexity parameter. It will receive the flag vector as only input, and a single number is expected as a result.

The complexity threshold can be specified in the app. Any constraint whose complexity is higher than the threshold will not be considered for the majority vote. Constraints with a complexity of 0 are also ignored. An infinite complexity means that the constraint is never fulfilled by any clustering.

Evaluate

This section provides numerical results for reference. The latest results can be saved in the global environment, which includes clustering results, constraints so far, and the suggested window size. Since this includes everything returned by compare_clusterings(), you could also use repeat_clustering() afterwards.

The constraint plots depict if the constraints are fulfilled or not for the given window sizes, where 1 means it was fulfilled and 0 means it wasn't. An error about a zero-dimension viewport indicates the plot height is too small to fit the plots, so please increase the height.

Note

The optimization mentioned in section 3.4 of Dau et al. (2016) is also implemented here.

Tracing is printed to the console.

Author(s)

Alexis Sarda-Espinosa

References

Dau, H. A., Begum, N., & Keogh, E. (2016). Semi-supervision dramatically improves time series clustering under dynamic time warping. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (pp. 999-1008). ACM. https://sites.google.com/site/dtwclustering/

See Also

interactive_clustering(), compare_clusterings()

Examples


## Not run: 
require(doParallel)
workers <- makeCluster(detectCores())
clusterEvalQ(workers, {
    library(dtwclust)
    RcppParallel::setThreadOptions(1L)
})
registerDoParallel(workers)
ssdtwclust(reinterpolate(CharTraj[1L:20L], 150L))

## End(Not run)


dtwclust documentation built on Sept. 11, 2024, 9:07 p.m.