merge_hist: Merge histogram bins
In paulemms/datamining: Data mining

merge_hist

R Documentation

Merge histogram bins

Description

Quantize a variable by merging similar histogram bins.

Usage

merge_hist(x, b = NULL, n = b, trace = T)

Arguments

`x`	a numerical vector
`b`	the starting number of bins, or a vector of starting break locations. If NULL, chosen automatically by `hist`.
`n`	the desired number of bins.

Details

The desired number of bins is achieved by successively merging the two most similar histogram bins. The distance between bins of height (f1,f2) and width (w1,w2) is measured according to the chi-square statistic

w1*(f1-f)^2/f + w2*(f2-f)^2/f

where f is the height of the merged bin:

f = (f1*w1 + f2*w2)/(w1 + w2)

Value

A vector of bin breaks, suitable for use in hist, bhist, or cut. Two plots are shown: a bhist using the returned bin breaks, and a merging trace. The trace shows, for each merge, the chi-square distance of the bins which were merged. This is useful for determining the appropriate number of bins. An interesting number of bins is one that directly precedes a sudden jump in the chi-square distance.

Author(s)

Tom Minka

Examples


x <- c(rnorm(100,-2,0.5),rnorm(100,2,0.5))
b <- seq(-4,4,by=0.25)
merge_hist(x,b,10)
# according to the merging trace, n=5 and n=11 are most interesting.

x <- runif(1000)
b <- seq(0,1,by=0.05)
merge_hist(x,b,10)
# according to the merging trace, n=6 and n=9 are most interesting.
# because the data is uniform, there should only be one bin,
# but chance deviations in density prevent this.
# a multiple comparisons correction in merge_hist may fix this.

paulemms/datamining documentation built on March 1, 2023, 4:01 p.m.