# Normalized Compression Distance

### Description

Computes the distance based on the sizes of the compressed time series.

### Usage

1 |

### Arguments

`x` |
Numeric vector containing the first of the two time series. |

`y` |
Numeric vector containing the second of the two time series. |

`type` |
Character string, the type of compression. May be abbreviated to a single letter, defaults to the first of the alternatives. |

### Details

The compression based dissimilarity is calculated:

* d(x,y) = C(xy) - max(C(x),C(y))/ min(C(x),C(y)) *

where *C(x)*, *C(y)* are the sizes in bytes of the compressed series *x* and *y*.
*C(xy)* is the size in bytes of the series *x* and *y* concatenated. The algorithm used for compressing the series is chosen with `type`

.
`type`

can be "gzip", "bzip2" or "xz", see `memCompress`

. "min" selects the best separately for `x`

, `y`

and the concatenation.
Since the compression methods are character-based, a symbolic representation can be used, see details for an example using SAX as the symbolic representation.
The series are transformed to a text representation prior to compression using `as.character`

, so small numeric differences may produce significantly different text representations.
While this dissimilarity is asymptotically symmetric, for short series the differences between `diss.NCD(x,y)`

and `diss.NCD(y,x)`

may be noticeable.

### Value

The computed distance.

### Author(s)

Pablo Montero Manso, José Antonio Vilar.

### References

Cilibrasi, R., & Vitányi, P. M. (2005). Clustering by compression. *Information Theory, IEEE Transactions on*, **51(4)**, 1523-1545.

Keogh, E., Lonardi, S., & Ratanamahatana, C. A. (2004). Towards parameter-free data mining. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 206-215).

Montero, P and Vilar, J.A. (2014) *TSclust: An R Package for Time Series Clustering.* Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.

### See Also

`memCompress`

, `diss`

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ```
n = 50
x <- rnorm(n) #generate sample series, white noise and a wiener process
y <- cumsum(rnorm(n))
diss.NCD(x, y)
z <- rnorm(n)
w <- cumsum(rnorm(n))
series = rbind(x, y, z, w)
diss(series, "NCD", type="bzip2")
################################################################
#####symbolic representation prior to compression, using SAX####
####simpler symbolization, such as round() could also be used###
################################################################
#normalization function, required for SAX
z.normalize = function(x) {
(x - mean(x)) / sd(x)
}
sx <- convert.to.SAX.symbol( z.normalize(x), alpha=4 )
sy <- convert.to.SAX.symbol( z.normalize(y), alpha=4 )
sz <- convert.to.SAX.symbol( z.normalize(z), alpha=4 )
sw <- convert.to.SAX.symbol( z.normalize(w), alpha=4 )
diss(rbind(sx, sy, sz, sw), "NCD", type="bzip2")
``` |