$\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\ve}[1]{\mathbf{#1}}$ $\renewcommand{\m}[1]{\mathbf{#1}}$ $\renewcommand{\sv}[1]{\boldsymbol{#1}}$ $\renewcommand{\pop}[1]{\mathcal{#1}}$ $\renewcommand{\samp}[1]{\mathcal{#1}}$ $\renewcommand{\imply}{\Longrightarrow}$ $\renewcommand{\leftgiven}{~\left\lvert~}$ $\renewcommand{\given}{~\vert~}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\widebar}[1]{\overline{#1}}$ $\renewcommand{\wig}[1]{\tilde{#1}}$ $\renewcommand{\bigwig}[1]{\widetilde{#1}}$ $\renewcommand{\field}[1]{\mathbb{#1}}$ $\renewcommand{\Reals}{\field{R}}$ $\renewcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert}$ $\renewcommand{\size}[1]{\left\lvert {#1} \right\rvert}$ $\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\norm}[1]{\left|\left|{#1}\right|\right|}$ $\renewcommand{\intersect}{\cap}$ $\renewcommand{\union}{\cup}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\st}{~:~}$
library(treebinr) library(ggplot2)
This vignette is designed to introduce the user to the implementation of the GapBin
methodology, which is described in detail in Rahman and Oldford (JCGS, 2018).
Typically, binning of data lacking a class label (i.e. unsupervised binning) is done via global binning techniques like Equal Width, Equal Frequency, or Hexagonal Binning. Of these, only equal frequency deals with the data in any way - both equal frequency and hexagonal binning simply tile over the range of the data. All of these methods also have difficulty scaling to higher dimensional space - hexagonal binning is not generalizable to arbitrary dimension, and tiling in high dimensions is incredibly costsly.
The GapBin
methodology attempts to improve on these methods by working directly with the data. It attempts to find a preferable split by computing a score between each pair of points based on the distance between them. Splitting in this way can be viewed as a tree like structure, in much the same style as classification and regression trees, lending itself to implementation in the treebin
style.
Unlike existing binning techniques, such as equal width binning which divides the data by placing cut-points at equally spaced intervals along each dimension of the data, GapBin
aims to find meaningful divisions in the data by computing a measure of dissimilarity between points. In doing so, unlike equal width, GapBin
is able to reduce the size of a point configuration in both low dimensional spaces and high dimensional spaces while retaining much of the underlying structure in the configuration.
To find a meaningful division of the data, as the name implies, GapBin
seeks to find the largest gap between points in a given dimension, subject to certain regularization to ensure density is preserved.
Define a data set of arbitrary size $n$, and arbitrary dimension $p$. Note that this is already an improvement over methods such as hexagonal binning, which restricts $p = 2$. Suppose this data set has already been subdivided into $m$ bins, labelled with the index $i$ to denote which bin each point belongs to. In bin $i$, containing $n_{i}$ points, define the vector $\ve{x}{i,k} = \ve{x}{i,k,(1)},...,\ve{x}{i,k,(n{i})}$ to be the sorted vector of point along dimension $k$.
The GapBin
criteria seeks to find the largest gap in each bin, relative to some other bin attributes. Define the maximum gap over bin $i$, dimension $k$ as
$$ g_{i,k} = max_{j}(x_{i,k,(j+1)} - x_{i,k,(j)}) $$
\noindent where $j = 1,...,n_{i}-1$. Next, define the bin score for bin $i$ as
$$ s_{i} = max_{k}(\frac{g_{i,k}~n_{i}^{\tau}}{x_{\cdot,k,(n)} - x_{\cdot,k,(1)}}) $$
Note that the denominator represents the range of the entire
dimension under consideration. Multiplying the bin score by the number of points prevents the algorithm from simply binning the outlying points individually, and dividing by the range of the dimension removes the effect of scale on the choice of bin. The default bin score uses a value of $tau = 1$, but this can be easily modified to change the effect of the bin count on the score.
This score is implemented for use in treebin
in the gapMeasure
function, which takes as arguments an object of class bin
and is passed the dimension ranges in a $2 \times p$ matrix containing the minimum values of each dimension in row 1, and the maximum values in row 2, using the named argument $dimRange$ in the $inputs$ list. The $\tau$ argument is also passed through the $inputs$ list using the named argument $tau$. The gapMeasure
function returns an $(n_{i}-1) ~ \times ~ p$ matrix containing all of the gap scores for bin $i$.
gapMeasure <- function(bin, inputs){ if(nrow(bin@contents) == 1){ return(matrix(NA, 1, ncol(bin@contents))) } #For Each Dimension, compute the gap statistic getGap <- function(binDim){ sortedBin <- sort(binDim) axis <- sortedBin[-length(sortedBin)] axis2 <- sortedBin[-1] gaps <- abs(axis - axis2) return(gaps) } gaps <- apply(bin@contents, 2, getGap) #Scale the gaps by the number of points in the bin and by tau n <- nrow(bin@contents) tau <- inputs$tau gaps <- gaps * n^tau #Scale the gaps by the dimension range dimRange <- matrix(inputs$dimRange, nrow=(n-1), ncol=ncol(bin@contents), byrow=TRUE) gaps <- gaps/dimRange return(gaps) }
In the case of GapBin
, given the measures computed in the previous section, choosing the bin to split is very easy. We seek the bin with the largest bin measure $s_{i}$, as defined in the previous section:
$$ Bin_{index} = argmax_{i}~~s_{i} $$
This is easily programmed in R
for use in treebin
in the function gapSelect
, taking as argument bins
, which is a list containing all current bin objects. While gapSelect
does not require any addition input parameters passed through inputs
, it must be there as an argument for the function to work within the general structure of treebin
.
gapSelect <- function(bins, inputs){ #Find the maximum gap in each bin findMax <- function(bin){ max <- max(bin@measure) return(max) } gapMax <- sapply(bins, findMax) #Find the bin with the largest of all gaps (in the event of a tie, take the first bin) binIndex <- which.max(gapMax) return(binIndex) }
With the bin to be split chosen in the previous section, actually splitting the bin is relatively simple. GapBin
chooses to split the bin along the dimension, say $k$, which produced the maximum bin score. Within this dimension, GapBin
splits the dimension at the midpoint of two points responsible for the maximum bin score, $\frac{x_{i,k,(j+1)} + x_{i,k,(j+1)}}{2}$.
The split establishes two bins. The first bin contains the points $x_{i,k,(1)},...,x_{i,k,(j)}$ and the second contains $x_{i,k,(j+1)},...,x_{i,k,(n_{i})}$. What remains is to update the boundaries of the newly formed bins. For the first bin, the maximum value in the $k^{th}$ dimension changes, while in the second bin, the minimum value in the $k^{th}$ dimension changes. All other boundaries remain constant in both bins. The new boundary in each case is the midpoint between the separating points:
$$ $\frac{x_{i,k,(j+1)} + x_{i,k,(j+1)}}{2}$ $$
This process is implemented in the function gapSplit
, which takes as input a bin object, which is the bin to be split, a user supplied function to compute the measure associated with the bin, and additional user supplied input values (again, in this case, there are no additional inputs necessary for this function, the function must take an additional inputs argument to be compatible with the general treebin
function). The gapSplit
functions returns two bin objects containing the newly formed bins.
gapSplit <- function(bin, binMeasure, inputs){ #Extract the bin information currentMeasure <- bin@measure currentBin <- bin@contents currentBoundary <- bin@info$binRange #Find the maximum measure in each dimension, and take the overall maximum as the dimension to be split whichAxis <- which.max(apply(currentMeasure,2,max)) #sort the bin along the chosen axis sortedBin <- currentBin[order(currentBin[,whichAxis]),] #Split the sorted bin at the maximum gap gapIndex <- which.max(currentMeasure[,whichAxis]) leftContents <- sortedBin[seq(1, gapIndex, 1),,drop=FALSE] rightContents <- sortedBin[seq(gapIndex + 1, nrow(currentBin), 1),,drop=FALSE] #Define the new bin boundaries newBoundary <- (max(leftContents[,whichAxis]) + min(rightContents[,whichAxis]))/2 #Split at the midpoint of the maximum gap leftBoundary <- currentBoundary leftBoundary[2,whichAxis] <- newBoundary #The boundary of the left leaf is the same as the parent, except a new maximum along the split dimension rightBoundary <- currentBoundary rightBoundary[1,whichAxis] <- newBoundary #The boundary of the right leaf is the same as the parent, except a new minimum along the split dimension #Put together the new bins in a binfo object leftBin <- bin(boundary = bin@boundary, contents = leftContents, measure = NULL, info = list(binRange = leftBoundary)) rightBin <- bin(boundary = bin@boundary, contents = rightContents, measure = NULL, info = list(binRange = rightBoundary)) #Compute the new bin measures leftBin@measure <- binMeasure(leftBin, inputs) rightBin@measure <- binMeasure(rightBin, inputs) #Return the new bins in a list return(list(leftBin, rightBin)) }
Given an out-of-sample point, the user must supply a function that returns TRUE
if the point belongs to the given bin, and FALSE
otherwise. The function must take three input parameters, the location of the point to be added, the list of all bin objects making up the configuration, and an additional inputs list containing any additional information the user requires.
Since the bins in GapBin
end up being rectangular, membership in any given bin can be easily determined by examining the boundaries of the bin, which are stored in the info
slot in each bin object (recall that these boundaries are updated in gapSplit
for instance, and we will see the initial boundaries are user supplied).
gapBoundaryTest <- function(point, bin, inputs){ point <- matrix(point,nrow=1) boundary <- bin@info$binRange indicator <- c() for(i in 1:ncol(point)){ indicator <- c(indicator, (point[i] < boundary[2,i] & point[i] > boundary[1,i])) } if(all(indicator)){ return(TRUE) }else{ return(FALSE) } }
Once the data has been reduced to the desired number of bins, the final step is to turn the contents of each bin into a single representative point. For GapBin
the representative point is the by-dimension mean of every original point present in the bin. This is implemented simply in R
in the function gapPoint
, which takes as input a bin object and additional user-supplied input parameters, and returns a representative point for that bin.
gapPoints <- function(bin, inputs){ point <- apply(bin@contents, 2, mean) return(point) }
For functions prune
, trim
, and trimTo
, we need to define a function to create a single bin
object from several bins (to be combined). For GapBin
, that means updating the boundary of the bins (by finding the respective maximum and minimum bin boundaries from those bins being combined), as well as computing the measure of the newly formed bin.
gapUpdate <- function(bins, binMeasure, inputs){ #Using pruned nodes, update information newContents <- c() newLowerBounds <- c() newUpperBounds <- c() for(i in 1:length(bins)){ newContents <- rbind(newContents, bins[[i]]@contents) newLowerBounds <- rbind(newLowerBounds, bins[[i]]@info$binRange[1,]) newUpperBounds <- rbind(newUpperBounds, bins[[i]]@info$binRange[2,]) } newBoundary <- rbind(apply(newLowerBounds, 2, min), apply(newUpperBounds, 2, max)) newBin <- bin(boundary = bins[[1]]@boundary, measure=NULL, contents = newContents, index = inputs$node, info=list(binRange = newBoundary)) newBin@measure <- binMeasure(newBin, inputs) return(newBin) }
To demonstrate the effectiveness of the GapBin
algorithm, and to introduce some additional functionality in t he treebinr
package, we consider the simple task of reducing the size of a bivariate normal point configuration from 100 points to 50 in two dimensions.
set.seed(123) data.norm <- data.frame(x=rnorm(100), y=rnorm(100)) ggplot(data.norm, aes(x=x, y=y)) + geom_point(alpha=.5) + xlim(c(min(data.norm[,1]), max(data.norm[,1]))) + ylim(c(min(data.norm[,2]), max(data.norm[,2]))) + theme_bw()
To bin the data, we make use of the treebin
function, which has the following form
treebin <- function(X, numbins, binMeasure, boundaryTest, selectBin, splitBin, makePoint, binInfo, inputs){ ... }
The inputs for binMeasure
, boundaryTest
, selectBin
, splitBin
, and makePoint
correspond to the functions described previously - gapMeasure
, gapBoundaryTest
, gapSelect
, gapSplit
, and gapPoints
respectively. The input parameters X
and numbins
are the data to be binned (given in a matrix), and the desired number of points for the reduced configuration.
The final input parameters binInfo
and inputs
are optional and deal with additional user supplied information required during the binning process. The binInfo
input parameter is used in the info slot of the original bin
object created during the binning process. For GapBin
, the range of each bin is stored in the info slot, so the binInfo
input is simply a $2\times p$ matrix containing the maximum and minimum values of the original, unbinned data in each dimension. The inputs
argument is a list containing any additional information required during the binning process (i.e. it contains additional input for functions such as binMeasure
, splitBin
, etc.). GapBin
requires an input parameter tau
, corresponding the the $\tau$ used in computing the GapBin
criteria, and also the range of each dimension.
Then, the data above can be binned using treebin
as follows
set.seed(1337) X <- matrix(rnorm(2000),ncol=2) #Bin to 500 Points numbins <- 500 nCols <- ncol(X) binMeasure <- gapMeasure selectBin <- gapSelect splitBin <- gapSplit boundaryTest <- gapBoundaryTest makePoint <- gapPoints dimRange <- sapply(1:nCols, FUN = function(j) {diff(range(X[,j]))}) inputs <- list(dimRange = dimRange, tau=1) binInfo <- list(binRange = sapply(1:nCols, FUN = function(j) {range(X[,j])})) out <- treebin(X, numbins, binMeasure, boundaryTest, selectBin, splitBin, makePoint, binInfo, inputs)
The resulting configuration has 500 points, which we can plot easily
ggplot(as.data.frame(out@points), aes(x=x, y=y)) + geom_point(alpha=.5) + xlim(c(min(data.norm[,1]), max(data.norm[,1]))) + ylim(c(min(data.norm[,2]), max(data.norm[,2]))) + theme_bw()
Using the stored boundary of each bin, the separating lines of each bin can be drawn.
points <- out@points counts <- out@counts bins <- out@bins numbins <- length(bins) #Create data frame for ggplot xstart <- numeric(numbins) ystart <- numeric(numbins) xend <- numeric(numbins) yend <- numeric(numbins) #Extract the bin boundary information stored in each bin for(i in 1:numbins){ binBoundary <- bins[[i]]@info$binRange xstart[i] <- binBoundary[1,1] ystart[i] <- binBoundary[1,2] xend[i] <- binBoundary[2,1] yend[i] <- binBoundary[2,2] } ggplot() + geom_rect(mapping=aes(xmin=xstart, xmax=xend, ymin=ystart, ymax=yend), fill="white", color="black", alpha=0.5) + geom_point(data=as.data.frame(out@points), mapping = aes(x=x, y=y), alpha=.5) + xlim(c(min(data.norm[,1]), max(data.norm[,1]))) + ylim(c(min(data.norm[,2]), max(data.norm[,2]))) + theme_bw()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.