Description Usage Arguments Details Author(s) References See Also Examples

Computes a metric between 0 and 1 of the amount of information lost about the underlying distribution of data for a given histogram.

1 2 3 4 |

`h` |
A |

`arrow.size.scale` |
specifies a size scaling factor for the arrow illustrating the point of Kolmogorov-Smirnov distance between the two e.c.d.fs |

`main` |
if 'method="constant"' a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. See ?approxfun |

`...` |
Any other arguments to pass to |

The `KSDCC`

(Kolmogorov-Smirnov Distance of the Cumulative Curves)
function provides the Kolmogorov-Smirnov distance between the empirical
distribution functions of the smallest and largest datasets that could
be represented by the binned data in the provided histogram. This
quantity is also called the Maximum Displacement of the Cumulative
Curves (MDCC) in the computer science performance evaluation community (see
references).

The `EMDCC`

(Earth Mover's Distance of the Cumulative Curves)
function is like the Kolmogorov-Smirnov statistic, but uses an integral
to capture the difference across all points of the curve rather than
just the maximum difference. This is also known as Mallows distance, or
Wasserstein distance with $p=1$.

The `PlotKSDCC`

and `PlotEMDCC`

functions take a histogram and
generate a plot showing a geometric representation of the information
loss metrics for the provided histogram.

Murray Stokely mstokely@google.com

Douceur, John R., and William J. Bolosky. "A large-scale
study of file-system contents." *ACM SIGMETRICS Performance
Evaluation Review* **27.1** (1999): 59-70.

`histogramtools-package`

,
`ecdf`

, and
`hist`

.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | ```
x <- rexp(1000)
h <- hist(x, breaks=c(0,1,2,3,4,8,16,32), plot=FALSE)
KSDCC(h)
# For small enough data sets we can construct the two extreme data sets
# that can be constructed from a histogram. One assuming every data point
# is on the left boundary of its bucket, and another assuming every data
# point is on the right boundary of its bucket. Our KSDCC metric for
# histograms is equivalent to the ks.test statistics for these two
# extreme data sets.
x.min <- rep(head(h$breaks, -1), h$counts)
x.max <- rep(tail(h$breaks, -1), h$counts)
ks.test(x.min, x.max, exact=FALSE)
## Not run:
PlotKSDCC(h)
## End(Not run)
EMDCC(h)
## Not run:
PlotEMDCC(h)
## End(Not run)
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.