SBD: Shape-based distance

View source: R/DISTANCES-sbd.R

SBDR Documentation

Shape-based distance

Description

Distance based on coefficient-normalized cross-correlation as proposed by Paparrizos and Gravano (2015) for the k-Shape clustering algorithm.

Usage

SBD(x, y, znorm = FALSE, error.check = TRUE, return.shifted = TRUE)

sbd(x, y, znorm = FALSE, error.check = TRUE, return.shifted = TRUE)

Arguments

x, y

Univariate time series.

znorm

Logical. Should each series be z-normalized before calculating the distance?

error.check

Logical indicating whether the function should try to detect inconsistencies and give more informative errors messages. Also used internally to avoid repeating checks.

return.shifted

Logical. Should the shifted version of y be returned? See details.

Details

This distance works best if the series are z-normalized. If not, at least they should have appropriate amplitudes, since the values of the signals do affect the outcome.

If x and y do not have the same length, it would be best if the longer sequence is provided in y, because it will be shifted to match x. After matching, the series may have to be truncated or extended and padded with zeros if needed.

The output values lie between 0 and 2, with 0 indicating perfect similarity.

Value

For return.shifted = FALSE, the numeric distance value, otherwise a list with:

  • dist: The shape-based distance between x and y.

  • yshift: A shifted version of y so that it optimally matches x (based on NCCc()).

Proxy version

The version registered with proxy::dist() is custom (loop = FALSE in proxy::pr_DB). The custom function handles multi-threaded parallelization directly with RcppParallel. It uses all available threads by default (see RcppParallel::defaultNumThreads()), but this can be changed by the user with RcppParallel::setThreadOptions().

An exception to the above is when it is called within a foreach parallel loop made by dtwclust. If the parallel workers do not have the number of threads explicitly specified, this function will default to 1 thread per worker. See the parallelization vignette for more information - browseVignettes("dtwclust")

It also includes symmetric optimizations to calculate only half a distance matrix when appropriate—only one list of series should be provided in x. Starting with version 6.0.0, this optimization means that the function returns an array with the lower triangular values of the distance matrix, similar to what stats::dist() does; see DistmatLowerTriangular for a helper to access elements as it if were a normal matrix. If you want to avoid this optimization, call proxy::dist by giving the same list of series in both x and y.

In some situations, e.g. for relatively small distance matrices, the overhead introduced by the logic that computes only half the distance matrix can be bigger than just calculating the whole matrix.

Note

If you wish to calculate the distance between several time series, it would be better to use the version registered with the proxy package, since it includes some small optimizations. See the examples.

This distance is calculated with help of the Fast Fourier Transform, so it can be sensitive to numerical precision. Thus, this function (and the functions that depend on it) might return different values in 32 bit installations compared to 64 bit ones.

References

Paparrizos J and Gravano L (2015). “k-Shape: Efficient and Accurate Clustering of Time Series.” In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, series SIGMOD '15, pp. 1855-1870. ISBN 978-1-4503-2758-9, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1145/2723372.2737793")}.

See Also

NCCc(), shape_extraction()

Examples


# load data
data(uciCT)

# distance between series of different lengths
sbd <- SBD(CharTraj[[1]], CharTraj[[100]], znorm = TRUE)$dist

# cross-distance matrix for series subset (notice the two-list input)
sbD <- proxy::dist(CharTraj[1:10], CharTraj[1:10], method = "SBD", znorm = TRUE)


dtwclust documentation built on Sept. 11, 2024, 9:07 p.m.