R/CLUSTERING-tadpole.R

Defines functions call_tadpole TADPole

Documented in TADPole

#' TADPole clustering
#'
#' Time-series Anytime Density Peaks Clustering as proposed by Begum et al. (2015).
#'
#' @export
#' @importFrom proxy dist
#'
#' @param data A matrix or data frame where each row is a time series, or a list where each element
#'   is a time series. Multivariate series are **not** supported.
#' @param window.size Window size constraint for DTW (Sakoe-Chiba). See details.
#' @param k The number of desired clusters. Can be a vector with several values.
#' @param dc The cutoff distance(s). Can be a vector with several values.
#' @template error-check
#' @param lb Which lower bound to use, "lbk" for [lb_keogh()] or "lbi" for [lb_improved()].
#' @param trace Logical flag. If `TRUE`, more output regarding the progress is printed to screen.
#'
#' @details
#'
#' This function can be called either directly or through [tsclust()].
#'
#' TADPole clustering adopts a relatively new clustering framework and adapts it to time series
#' clustering with DTW. See the cited article for the details of the algorithm.
#'
#' Because of the way the algorithm works, it can be considered a kind of Partitioning Around
#' Medoids (PAM). This means that the cluster centroids are always elements of the data. However,
#' this algorithm is deterministic, depending on the value of `dc`.
#'
#' The algorithm first uses the DTW's upper and lower bounds (Euclidean and LB_Keogh respectively)
#' to find series with many close neighbors (in DTW space). Anything below the cutoff distance
#' (`dc`) is considered a neighbor. Aided with this information, the algorithm then tries to prune
#' as many DTW calculations as possible in order to accelerate the clustering procedure. The series
#' that lie in dense areas (i.e. that have lots of neighbors) are taken as cluster centroids.
#'
#' The algorithm relies on the DTW bounds, which are only defined for univariate time series of
#' equal length.
#'
#' Parallelization is supported in the following way:
#'
#' - For multiple `dc` values, multi-processing with [foreach::foreach()].
#' - The internal distance calculations use multi-threading with [RcppParallel::RcppParallel].
#'
#' @template window
#'
#' @return
#'
#' A list with:
#'
#' - `cl`: Cluster indices.
#' - `centroids`: Indices of the centroids.
#' - `distCalcPercentage`: Percentage of distance calculations that were actually performed.
#'
#' For multiple `k`/`dc` values, a list of lists is returned, each internal list having the
#' aforementioned elements.
#'
#' @template rcpp-parallel
#'
#' @references
#'
#' Begum N, Ulanova L, Wang J and Keogh E (2015). ``Accelerating Dynamic Time Warping Clustering
#' with a Novel Admissible Pruning Strategy.'' In *Conference on Knowledge Discovery and Data
#' Mining*, series KDD '15. ISBN 978-1-4503-3664-2/15/08, \doi{10.1145/2783258.2783286}.
#'
TADPole <- function(data, k = 2L, dc, window.size, error.check = TRUE, lb = "lbk", trace = FALSE) {
    if (missing(window.size)) stop("Please provide a positive window size") # nocov start
    if (missing(dc)) stop("Please provide the 'dc' parameter")
    if (any(dc < 0)) stop("The cutoff distance 'dc' must be positive")
    x <- tslist(data)
    n <- length(x)
    if (n < 2L) stop("data should have more than one time series")
    if (any(k > n)) stop("Number of clusters should be less than the number of time series")
    lb <- match.arg(lb, c("lbk", "lbi"))
    if (trace) cat("Computing lower and upper bound matrices\n") # nocov end
    # Calculate matrices with bounds (error check in lbk/lbi)
    LBM <- proxy::dist(x, x,
                       method = lb,
                       window.size = window.size,
                       force.symmetry = TRUE,
                       norm = "L2",
                       error.check = error.check)
    # NOTE: Euclidean is only valid as upper bound if 'symmetric1' step pattern is used
    UBM <- proxy::dist(x, x, method = "L2")

    len <- max(lengths(data)) + 1L
    dtw_args <- list(window.size = window.size,
                     norm = 2,
                     step.pattern = 1,
                     backtrack = FALSE,
                     gcm = matrix(0, 2L, len),
                     is.multivariate = FALSE,
                     normalize = FALSE,
                     sqrt.dist = TRUE)

    RET <- foreach(
        dc = dc, .combine = c, .multicombine = TRUE,
        .packages = "dtwclust", .export = "call_tadpole") %op%
        {
            ret <- lapply(k, function(dummy) { list() }) # modified in place in C++
            call_tadpole(x, k, dc, dtw_args, LBM, UBM, trace, ret) # foreach can't see C objects
            # return
            ret
        }
    # Return
    if (length(RET) == 1L) RET[[1L]] else RET
}

# a parallel foreach loop will not see the C objects, so I need to pass a function from the
# dtwclust namespace (via the .export parameter)
call_tadpole <- function(x, k, dc, dtw_args, LBM, UBM, trace, ret) {
    # UTILS-utils.R
    num_threads <- get_nthreads()
    .Call(C_tadpole, x, k, dc, dtw_args, LBM, UBM, trace, ret, num_threads, PACKAGE = "dtwclust")
}

#' @rdname TADPole
#' @export
#'
tadpole <- TADPole
asardaes/dtwclust documentation built on March 3, 2023, 5:32 a.m.