View source: R/DISTANCES-sbd.R
SBD | R Documentation |
Distance based on coefficient-normalized cross-correlation as proposed by Paparrizos and Gravano (2015) for the k-Shape clustering algorithm.
SBD(x, y, znorm = FALSE, error.check = TRUE, return.shifted = TRUE)
sbd(x, y, znorm = FALSE, error.check = TRUE, return.shifted = TRUE)
x , y |
Univariate time series. |
znorm |
Logical. Should each series be z-normalized before calculating the distance? |
error.check |
Logical indicating whether the function should try to detect inconsistencies and give more informative errors messages. Also used internally to avoid repeating checks. |
return.shifted |
Logical. Should the shifted version of |
This distance works best if the series are z-normalized. If not, at least they should have appropriate amplitudes, since the values of the signals do affect the outcome.
If x
and y
do not have the same length, it would be best if the longer sequence is
provided in y
, because it will be shifted to match x
. After matching, the series may have to
be truncated or extended and padded with zeros if needed.
The output values lie between 0 and 2, with 0 indicating perfect similarity.
For return.shifted = FALSE
, the numeric distance value, otherwise a list with:
dist
: The shape-based distance between x
and y
.
yshift
: A shifted version of y
so that it optimally matches x
(based on NCCc()
).
The version registered with proxy::dist()
is custom (loop = FALSE
in proxy::pr_DB).
The custom function handles multi-threaded parallelization directly with RcppParallel.
It uses all available threads by default (see RcppParallel::defaultNumThreads()
),
but this can be changed by the user with RcppParallel::setThreadOptions()
.
An exception to the above is when it is called within a foreach
parallel loop made by dtwclust.
If the parallel workers do not have the number of threads explicitly specified,
this function will default to 1 thread per worker.
See the parallelization vignette for more information - browseVignettes("dtwclust")
It also includes symmetric optimizations to calculate only half a distance matrix when appropriate—only one list of series should be provided in x
.
Starting with version 6.0.0, this optimization means that the function returns an array with the lower triangular values of the distance matrix,
similar to what stats::dist()
does;
see DistmatLowerTriangular for a helper to access elements as it if were a normal matrix.
If you want to avoid this optimization, call proxy::dist by giving the same list of series in both x
and y
.
In some situations, e.g. for relatively small distance matrices, the overhead introduced by the logic that computes only half the distance matrix can be bigger than just calculating the whole matrix.
If you wish to calculate the distance between several time series, it would be better to use the
version registered with the proxy
package, since it includes some small optimizations. See the
examples.
This distance is calculated with help of the Fast Fourier Transform, so it can be sensitive to numerical precision. Thus, this function (and the functions that depend on it) might return different values in 32 bit installations compared to 64 bit ones.
Paparrizos J and Gravano L (2015). “k-Shape: Efficient and Accurate Clustering of Time Series.” In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, series SIGMOD '15, pp. 1855-1870. ISBN 978-1-4503-2758-9, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1145/2723372.2737793")}.
NCCc()
, shape_extraction()
# load data
data(uciCT)
# distance between series of different lengths
sbd <- SBD(CharTraj[[1]], CharTraj[[100]], znorm = TRUE)$dist
# cross-distance matrix for series subset (notice the two-list input)
sbD <- proxy::dist(CharTraj[1:10], CharTraj[1:10], method = "SBD", znorm = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.