R/CLUSTERING-tadpole.R In dtwclust: Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance

```#' TADPole clustering
#'
#' Time-series Anytime Density Peaks Clustering as proposed by Begum et al. (2015).
#'
#' @export
#' @importFrom proxy dist
#'
#' @param data A matrix or data frame where each row is a time series, or a list where each element
#'   is a time series. Multivariate series are **not** supported.
#' @param window.size Window size constraint for DTW (Sakoe-Chiba). See details.
#' @param k The number of desired clusters. Can be a vector with several values.
#' @param dc The cutoff distance(s). Can be a vector with several values.
#' @template error-check
#' @param lb Which lower bound to use, "lbk" for [lb_keogh()] or "lbi" for [lb_improved()].
#' @param trace Logical flag. If `TRUE`, more output regarding the progress is printed to screen.
#'
#' @details
#'
#' This function can be called either directly or through [tsclust()].
#'
#' clustering with DTW. See the cited article for the details of the algorithm.
#'
#' Because of the way the algorithm works, it can be considered a kind of Partitioning Around
#' Medoids (PAM). This means that the cluster centroids are always elements of the data. However,
#' this algorithm is deterministic, depending on the value of `dc`.
#'
#' The algorithm first uses the DTW's upper and lower bounds (Euclidean and LB_Keogh respectively)
#' to find series with many close neighbors (in DTW space). Anything below the cutoff distance
#' (`dc`) is considered a neighbor. Aided with this information, the algorithm then tries to prune
#' as many DTW calculations as possible in order to accelerate the clustering procedure. The series
#' that lie in dense areas (i.e. that have lots of neighbors) are taken as cluster centroids.
#'
#' The algorithm relies on the DTW bounds, which are only defined for univariate time series of
#' equal length.
#'
#' Parallelization is supported in the following way:
#'
#' - For multiple `dc` values, multi-processing with [foreach::foreach()].
#' - The internal distance calculations use multi-threading with [RcppParallel::RcppParallel].
#'
#' @template window
#'
#' @return
#'
#' A list with:
#'
#' - `cl`: Cluster indices.
#' - `centroids`: Indices of the centroids.
#' - `distCalcPercentage`: Percentage of distance calculations that were actually performed.
#'
#' For multiple `k`/`dc` values, a list of lists is returned, each internal list having the
#' aforementioned elements.
#'
#' @template rcpp-parallel
#'
#' @references
#'
#' Begum N, Ulanova L, Wang J and Keogh E (2015). ``Accelerating Dynamic Time Warping Clustering
#' with a Novel Admissible Pruning Strategy.'' In *Conference on Knowledge Discovery and Data
#' Mining*, series KDD '15. ISBN 978-1-4503-3664-2/15/08, \doi{10.1145/2783258.2783286}.
#'
TADPole <- function(data, k = 2L, dc, window.size, error.check = TRUE, lb = "lbk", trace = FALSE) {
if (missing(window.size)) stop("Please provide a positive window size") # nocov start
if (missing(dc)) stop("Please provide the 'dc' parameter")
if (any(dc < 0)) stop("The cutoff distance 'dc' must be positive")
x <- tslist(data)
n <- length(x)
if (n < 2L) stop("data should have more than one time series")
if (any(k > n)) stop("Number of clusters should be less than the number of time series")
lb <- match.arg(lb, c("lbk", "lbi"))
if (trace) cat("Computing lower and upper bound matrices\n") # nocov end
# Calculate matrices with bounds (error check in lbk/lbi)
LBM <- proxy::dist(x, x,
method = lb,
window.size = window.size,
force.symmetry = TRUE,
norm = "L2",
error.check = error.check)
# NOTE: Euclidean is only valid as upper bound if 'symmetric1' step pattern is used
UBM <- proxy::dist(x, x, method = "L2")

len <- max(lengths(data)) + 1L
dtw_args <- list(window.size = window.size,
norm = 2,
step.pattern = 1,
backtrack = FALSE,
gcm = matrix(0, 2L, len),
is.multivariate = FALSE,
normalize = FALSE,
sqrt.dist = TRUE)

RET <- foreach(
dc = dc, .combine = c, .multicombine = TRUE,
.packages = "dtwclust", .export = "call_tadpole") %op%
{
ret <- lapply(k, function(dummy) { list() }) # modified in place in C++
call_tadpole(x, k, dc, dtw_args, LBM, UBM, trace, ret) # foreach can't see C objects
# return
ret
}
# Return
if (length(RET) == 1L) RET[[1L]] else RET
}

# a parallel foreach loop will not see the C objects, so I need to pass a function from the
# dtwclust namespace (via the .export parameter)
call_tadpole <- function(x, k, dc, dtw_args, LBM, UBM, trace, ret) {
# UTILS-utils.R
.Call(C_tadpole, x, k, dc, dtw_args, LBM, UBM, trace, ret, num_threads, PACKAGE = "dtwclust")
}