In rwoldford/treebinr: Data Reduction via Tree-based Binning

$\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\ve}[1]{\mathbf{#1}}$ $\renewcommand{\m}[1]{\mathbf{#1}}$ $\renewcommand{\sv}[1]{\boldsymbol{#1}}$ $\renewcommand{\pop}[1]{\mathcal{#1}}$ $\renewcommand{\samp}[1]{\mathcal{#1}}$ $\renewcommand{\imply}{\Longrightarrow}$ $\renewcommand{\leftgiven}{~\left\lvert~}$ $\renewcommand{\given}{~\vert~}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\widebar}[1]{\overline{#1}}$ $\renewcommand{\wig}[1]{\tilde{#1}}$ $\renewcommand{\bigwig}[1]{\widetilde{#1}}$ $\renewcommand{\field}[1]{\mathbb{#1}}$ $\renewcommand{\Reals}{\field{R}}$ $\renewcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert}$ $\renewcommand{\size}[1]{\left\lvert {#1} \right\rvert}$ $\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\norm}[1]{\left|\left|{#1}\right|\right|}$ $\renewcommand{\intersect}{\cap}$ $\renewcommand{\union}{\cup}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\st}{~:~}$

library(treebinr)
library(ggplot2)

Introduction

This vignette is designed to introduce the user to the implementation of the GapBin methodology, which is described in detail in Rahman and Oldford (JCGS, 2018).

Typically, binning of data lacking a class label (i.e. unsupervised binning) is done via global binning techniques like Equal Width, Equal Frequency, or Hexagonal Binning. Of these, only equal frequency deals with the data in any way - both equal frequency and hexagonal binning simply tile over the range of the data. All of these methods also have difficulty scaling to higher dimensional space - hexagonal binning is not generalizable to arbitrary dimension, and tiling in high dimensions is incredibly costsly.

The GapBin methodology attempts to improve on these methods by working directly with the data. It attempts to find a preferable split by computing a score between each pair of points based on the distance between them. Splitting in this way can be viewed as a tree like structure, in much the same style as classification and regression trees, lending itself to implementation in the treebin style.

Methodology

Unlike existing binning techniques, such as equal width binning which divides the data by placing cut-points at equally spaced intervals along each dimension of the data, GapBin aims to find meaningful divisions in the data by computing a measure of dissimilarity between points. In doing so, unlike equal width, GapBin is able to reduce the size of a point configuration in both low dimensional spaces and high dimensional spaces while retaining much of the underlying structure in the configuration.

Measure

To find a meaningful division of the data, as the name implies, GapBin seeks to find the largest gap between points in a given dimension, subject to certain regularization to ensure density is preserved.

Define a data set of arbitrary size $n$, and arbitrary dimension $p$. Note that this is already an improvement over methods such as hexagonal binning, which restricts $p = 2$. Suppose this data set has already been subdivided into $m$ bins, labelled with the index $i$ to denote which bin each point belongs to. In bin $i$, containing $n_{i}$ points, define the vector $\ve{x}{i,k} = \ve{x}{i,k,(1)},...,\ve{x}{i,k,(n{i})}$ to be the sorted vector of point along dimension $k$.

The GapBin criteria seeks to find the largest gap in each bin, relative to some other bin attributes. Define the maximum gap over bin $i$, dimension $k$ as

$$ g_{i,k} = max_{j}(x_{i,k,(j+1)} - x_{i,k,(j)}) $$

\noindent where $j = 1,...,n_{i}-1$. Next, define the bin score for bin $i$ as

$$ s_{i} = max_{k}(\frac{g_{i,k}~n_{i}^{\tau}}{x_{\cdot,k,(n)} - x_{\cdot,k,(1)}}) $$

Note that the denominator represents the range of the entire dimension under consideration. Multiplying the bin score by the number of points prevents the algorithm from simply binning the outlying points individually, and dividing by the range of the dimension removes the effect of scale on the choice of bin. The default bin score uses a value of $tau = 1$, but this can be easily modified to change the effect of the bin count on the score.

This score is implemented for use in treebin in the gapMeasure function, which takes as arguments an object of class bin and is passed the dimension ranges in a $2 \times p$ matrix containing the minimum values of each dimension in row 1, and the maximum values in row 2, using the named argument $dimRange$ in the $inputs$ list. The $\tau$ argument is also passed through the $inputs$ list using the named argument $tau$. The gapMeasure function returns an $(n_{i}-1) ~ \times ~ p$ matrix containing all of the gap scores for bin $i$.

gapMeasure <- function(bin, inputs){

  if(nrow(bin@contents) == 1){
    return(matrix(NA, 1, ncol(bin@contents)))
  }

  #For Each Dimension, compute the gap statistic
  getGap <- function(binDim){

    sortedBin <- sort(binDim)
    axis <- sortedBin[-length(sortedBin)]
    axis2 <- sortedBin[-1]

    gaps <- abs(axis - axis2)

    return(gaps)
  }

  gaps <- apply(bin@contents, 2, getGap)

  #Scale the gaps by the number of points in the bin and by tau
  n <- nrow(bin@contents)
  tau <- inputs$tau

  gaps <- gaps * n^tau

  #Scale the gaps by the dimension range
  dimRange <- matrix(inputs$dimRange, nrow=(n-1), ncol=ncol(bin@contents), byrow=TRUE)
  gaps <- gaps/dimRange

  return(gaps)
}

Bin Selection

In the case of GapBin, given the measures computed in the previous section, choosing the bin to split is very easy. We seek the bin with the largest bin measure $s_{i}$, as defined in the previous section:

$$ Bin_{index} = argmax_{i}~~s_{i} $$

This is easily programmed in R for use in treebin in the function gapSelect, taking as argument bins, which is a list containing all current bin objects. While gapSelect does not require any addition input parameters passed through inputs, it must be there as an argument for the function to work within the general structure of treebin.

gapSelect <- function(bins, inputs){

  #Find the maximum gap in each bin
  findMax <- function(bin){
    max <- max(bin@measure)
    return(max)
  }

  gapMax <- sapply(bins, findMax)

  #Find the bin with the largest of all gaps (in the event of a tie, take the first bin)
  binIndex <- which.max(gapMax)

  return(binIndex)
}

Splitting a Bin

With the bin to be split chosen in the previous section, actually splitting the bin is relatively simple. GapBin chooses to split the bin along the dimension, say $k$, which produced the maximum bin score. Within this dimension, GapBin splits the dimension at the midpoint of two points responsible for the maximum bin score, $\frac{x_{i,k,(j+1)} + x_{i,k,(j+1)}}{2}$.

The split establishes two bins. The first bin contains the points $x_{i,k,(1)},...,x_{i,k,(j)}$ and the second contains $x_{i,k,(j+1)},...,x_{i,k,(n_{i})}$. What remains is to update the boundaries of the newly formed bins. For the first bin, the maximum value in the $k^{th}$ dimension changes, while in the second bin, the minimum value in the $k^{th}$ dimension changes. All other boundaries remain constant in both bins. The new boundary in each case is the midpoint between the separating points:

$$ $\frac{x_{i,k,(j+1)} + x_{i,k,(j+1)}}{2}$ $$

This process is implemented in the function gapSplit, which takes as input a bin object, which is the bin to be split, a user supplied function to compute the measure associated with the bin, and additional user supplied input values (again, in this case, there are no additional inputs necessary for this function, the function must take an additional inputs argument to be compatible with the general treebin function). The gapSplit functions returns two bin objects containing the newly formed bins.

gapSplit <- function(bin, binMeasure, inputs){

  #Extract the bin information
  currentMeasure <- bin@measure
  currentBin <- bin@contents
  currentBoundary <- bin@info$binRange

  #Find the maximum measure in each dimension, and take the overall maximum as the dimension to be split
  whichAxis <- which.max(apply(currentMeasure,2,max))

  #sort the bin along the chosen axis
  sortedBin <- currentBin[order(currentBin[,whichAxis]),]

  #Split the sorted bin at the maximum gap
  gapIndex <- which.max(currentMeasure[,whichAxis])

  leftContents <- sortedBin[seq(1, gapIndex, 1),,drop=FALSE]
  rightContents <- sortedBin[seq(gapIndex + 1, nrow(currentBin), 1),,drop=FALSE]

  #Define the new bin boundaries
  newBoundary <- (max(leftContents[,whichAxis]) + min(rightContents[,whichAxis]))/2   #Split at the midpoint of the maximum gap

  leftBoundary <- currentBoundary
  leftBoundary[2,whichAxis] <- newBoundary #The boundary of the left leaf is the same as the parent, except a new maximum along the split dimension

  rightBoundary <- currentBoundary
  rightBoundary[1,whichAxis] <- newBoundary  #The boundary of the right leaf is the same as the parent, except a new minimum along the split dimension

  #Put together the new bins in a binfo object
  leftBin <- bin(boundary = bin@boundary, contents = leftContents, measure = NULL, info = list(binRange = leftBoundary))
  rightBin <- bin(boundary = bin@boundary, contents = rightContents, measure = NULL, info = list(binRange = rightBoundary))

  #Compute the new bin measures
  leftBin@measure <- binMeasure(leftBin, inputs)
  rightBin@measure <- binMeasure(rightBin, inputs)

  #Return the new bins in a list
  return(list(leftBin, rightBin))

}

Testing Bin Inclusion

Given an out-of-sample point, the user must supply a function that returns TRUE if the point belongs to the given bin, and FALSE otherwise. The function must take three input parameters, the location of the point to be added, the list of all bin objects making up the configuration, and an additional inputs list containing any additional information the user requires.

Since the bins in GapBin end up being rectangular, membership in any given bin can be easily determined by examining the boundaries of the bin, which are stored in the info slot in each bin object (recall that these boundaries are updated in gapSplit for instance, and we will see the initial boundaries are user supplied).

gapBoundaryTest <- function(point, bin, inputs){

  point <- matrix(point,nrow=1)
  boundary <- bin@info$binRange
  indicator <- c()

  for(i in 1:ncol(point)){
    indicator <- c(indicator, (point[i] < boundary[2,i] & point[i] > boundary[1,i]))
  }

  if(all(indicator)){
    return(TRUE)
  }else{
    return(FALSE)
  }
}

Creating a Point

Once the data has been reduced to the desired number of bins, the final step is to turn the contents of each bin into a single representative point. For GapBin the representative point is the by-dimension mean of every original point present in the bin. This is implemented simply in R in the function gapPoint, which takes as input a bin object and additional user-supplied input parameters, and returns a representative point for that bin.

gapPoints <- function(bin, inputs){
  point <- apply(bin@contents, 2, mean)
  return(point)
}

Updating a Bin

For functions prune, trim, and trimTo, we need to define a function to create a single bin object from several bins (to be combined). For GapBin, that means updating the boundary of the bins (by finding the respective maximum and minimum bin boundaries from those bins being combined), as well as computing the measure of the newly formed bin.

gapUpdate <- function(bins, binMeasure, inputs){

  #Using pruned nodes, update information
  newContents <- c()
  newLowerBounds <- c()
  newUpperBounds <- c()

  for(i in 1:length(bins)){
    newContents <- rbind(newContents, bins[[i]]@contents)
    newLowerBounds <- rbind(newLowerBounds, bins[[i]]@info$binRange[1,])
    newUpperBounds <- rbind(newUpperBounds, bins[[i]]@info$binRange[2,])
  }

  newBoundary <- rbind(apply(newLowerBounds, 2, min), apply(newUpperBounds, 2, max))

  newBin <- bin(boundary = bins[[1]]@boundary, 
                measure=NULL, 
                contents = newContents, 
                index = inputs$node, 
                info=list(binRange = newBoundary))

  newBin@measure <- binMeasure(newBin, inputs)

  return(newBin)
}

Example

To demonstrate the effectiveness of the GapBin algorithm, and to introduce some additional functionality in t he treebinr package, we consider the simple task of reducing the size of a bivariate normal point configuration from 100 points to 50 in two dimensions.

set.seed(123)
data.norm <- data.frame(x=rnorm(100), y=rnorm(100))

ggplot(data.norm, aes(x=x, y=y)) + 
  geom_point(alpha=.5) +
  xlim(c(min(data.norm[,1]), max(data.norm[,1]))) +
  ylim(c(min(data.norm[,2]), max(data.norm[,2]))) +
  theme_bw()

Initial Binning

To bin the data, we make use of the treebin function, which has the following form

treebin <- function(X, numbins, binMeasure, boundaryTest, selectBin, splitBin, makePoint, binInfo, inputs){
 ... 
}

The inputs for binMeasure, boundaryTest, selectBin, splitBin, and makePoint correspond to the functions described previously - gapMeasure, gapBoundaryTest, gapSelect, gapSplit, and gapPoints respectively. The input parameters X and numbins are the data to be binned (given in a matrix), and the desired number of points for the reduced configuration.

The final input parameters binInfo and inputs are optional and deal with additional user supplied information required during the binning process. The binInfo input parameter is used in the info slot of the original bin object created during the binning process. For GapBin, the range of each bin is stored in the info slot, so the binInfo input is simply a $2\times p$ matrix containing the maximum and minimum values of the original, unbinned data in each dimension. The inputs argument is a list containing any additional information required during the binning process (i.e. it contains additional input for functions such as binMeasure, splitBin, etc.). GapBin requires an input parameter tau, corresponding the the $\tau$ used in computing the GapBin criteria, and also the range of each dimension.

Then, the data above can be binned using treebin as follows

set.seed(1337)
X <- matrix(rnorm(2000),ncol=2)

#Bin to 500 Points
numbins <- 500
nCols <- ncol(X)
binMeasure <- gapMeasure
selectBin <- gapSelect
splitBin <- gapSplit
boundaryTest <- gapBoundaryTest
makePoint <- gapPoints
dimRange <- sapply(1:nCols, FUN = function(j) {diff(range(X[,j]))})
inputs <- list(dimRange = dimRange, tau=1)
binInfo <- list(binRange = sapply(1:nCols, FUN = function(j) {range(X[,j])}))

out <- treebin(X, numbins, binMeasure, boundaryTest, selectBin, splitBin, makePoint, binInfo, inputs)

The resulting configuration has 500 points, which we can plot easily

ggplot(as.data.frame(out@points), aes(x=x, y=y)) + 
  geom_point(alpha=.5) +
  xlim(c(min(data.norm[,1]), max(data.norm[,1]))) +
  ylim(c(min(data.norm[,2]), max(data.norm[,2]))) +
  theme_bw()

Using the stored boundary of each bin, the separating lines of each bin can be drawn.

points <- out@points
counts <- out@counts
bins <- out@bins
numbins <- length(bins)

#Create data frame for ggplot
xstart <- numeric(numbins)
ystart <- numeric(numbins)
xend <- numeric(numbins)
yend <- numeric(numbins)

#Extract the bin boundary information stored in each bin
for(i in 1:numbins){
  binBoundary <- bins[[i]]@info$binRange

  xstart[i] <- binBoundary[1,1]
  ystart[i] <- binBoundary[1,2]
  xend[i] <- binBoundary[2,1]
  yend[i] <- binBoundary[2,2]
}

ggplot() + 
  geom_rect(mapping=aes(xmin=xstart, xmax=xend, ymin=ystart, ymax=yend), fill="white", color="black", alpha=0.5) +
  geom_point(data=as.data.frame(out@points), mapping = aes(x=x, y=y), alpha=.5) +
  xlim(c(min(data.norm[,1]), max(data.norm[,1]))) +
  ylim(c(min(data.norm[,2]), max(data.norm[,2]))) +
  theme_bw()