informationloss: Information Loss Metrics for Histograms
In HistogramTools: Utility Functions for R Histograms

Description Usage Arguments Details Author(s) References See Also Examples

Computes a metric between 0 and 1 of the amount of information lost about the underlying distribution of data for a given histogram.

KSDCC(h)
EMDCC(h)
PlotKSDCC(h, arrow.size.scale=1, main=paste("KSDCC =", KSDCC(h)), ...)
PlotEMDCC(h, main=paste("EMDCC =", EMDCC(h)), ...)

`h`	A `"histogram"` object (created by `hist`) representing a pre-binned dataset on which we'd like to calculate the information loss due to binning.
`arrow.size.scale`	specifies a size scaling factor for the arrow illustrating the point of Kolmogorov-Smirnov distance between the two e.c.d.fs
`main`	if 'method="constant"' a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. See ?approxfun
`...`	Any other arguments to pass to `plot`

The KSDCC (Kolmogorov-Smirnov Distance of the Cumulative Curves) function provides the Kolmogorov-Smirnov distance between the empirical distribution functions of the smallest and largest datasets that could be represented by the binned data in the provided histogram. This quantity is also called the Maximum Displacement of the Cumulative Curves (MDCC) in the computer science performance evaluation community (see references).

The EMDCC (Earth Mover's Distance of the Cumulative Curves) function is like the Kolmogorov-Smirnov statistic, but uses an integral to capture the difference across all points of the curve rather than just the maximum difference. This is also known as Mallows distance, or Wasserstein distance with $p=1$.

The PlotKSDCC and PlotEMDCC functions take a histogram and generate a plot showing a geometric representation of the information loss metrics for the provided histogram.

Murray Stokely mstokely@google.com

Douceur, John R., and William J. Bolosky. "A large-scale study of file-system contents." ACM SIGMETRICS Performance Evaluation Review 27.1 (1999): 59-70.

histogramtools-package, ecdf, and hist.

x <- rexp(1000)
h <- hist(x, breaks=c(0,1,2,3,4,8,16,32), plot=FALSE)
KSDCC(h)

# For small enough data sets we can construct the two extreme data sets
# that can be constructed from a histogram.  One assuming every data point
# is on the left boundary of its bucket, and another assuming every data
# point is on the right boundary of its bucket.  Our KSDCC metric for
# histograms is equivalent to the ks.test statistics for these two
# extreme data sets.

x.min <- rep(head(h$breaks, -1), h$counts)
x.max <- rep(tail(h$breaks, -1), h$counts)
ks.test(x.min, x.max, exact=FALSE)

## Not run: 
PlotKSDCC(h)

## End(Not run)

EMDCC(h)
## Not run: 
PlotEMDCC(h)

## End(Not run)