In rwoldford/treebinr: Data Reduction via Tree-based Binning

$\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\ve}[1]{\mathbf{#1}}$ $\renewcommand{\m}[1]{\mathbf{#1}}$ $\renewcommand{\sv}[1]{\boldsymbol{#1}}$ $\renewcommand{\pop}[1]{\mathcal{#1}}$ $\renewcommand{\samp}[1]{\mathcal{#1}}$ $\renewcommand{\imply}{\Longrightarrow}$ $\renewcommand{\leftgiven}{~\left\lvert~}$ $\renewcommand{\given}{~\vert~}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\widebar}[1]{\overline{#1}}$ $\renewcommand{\wig}[1]{\tilde{#1}}$ $\renewcommand{\bigwig}[1]{\widetilde{#1}}$ $\renewcommand{\field}[1]{\mathbb{#1}}$ $\renewcommand{\Reals}{\field{R}}$ $\renewcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert}$ $\renewcommand{\size}[1]{\left\lvert {#1} \right\rvert}$ $\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\norm}[1]{\left|\left|{#1}\right|\right|}$ $\renewcommand{\intersect}{\cap}$ $\renewcommand{\union}{\cup}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\st}{~:~}$

library(treebinr)
library(ggplot2)
library(grid)
library(gridExtra)

Introduction

In this vignette, we introduce the basic functionality of the treebinr package, by using the methodology of GapBin which is the default functionality. Technical details can be found in the accompanying GapBin vignette, and the accompanying JSS vignette.

Object Classes

The treebin package works with two object classes, bin and treebinr. The bin object class is used internally in treebin, and is also part of the output provided by the function. It is an S4 class with the following slots:

|Slot |Contents | |--------:|:--------| |boundary | A function to test the inclusion of a point in the bin (i.e. a boundaryTest function) | |contents | A matrix containing the points in the bin | |measure | The measure (as computed by binMeasure) associated with the bin | |index | The index relating the bin to the internal decision tree | |info | Additional information specified by the user |

The treebinr object is the final output provided by treebin. It is an S4 object with the following slots:

|Slot |Contents | |--------:|:--------| |points | A matrix containing the representative points of each bin | |counts | A numeric vector containing the number of points in each bin | |bins | A list containing the final bin objects created during the treebin process | |tree | A matrix representing an directed graph of the binning process|

The Main Function - treebin

The main function available in the treebinr package is treebin, which takes the following form:

treebin <- function(X, 
                    numbins, 
                    binMeasure = gapMeasure,
                    boundaryTest = gapBoundaryTest,
                    selectBin = gapSelect,
                    splitBin = gapSplit,
                    makePoint = gapPoint,
                    binInfo = list(binRange = sapply(1:ncol(X), FUN = function(j){range(X[,j])})),
                    inputs = list(dimRange = sapply(1:ncol(X), FUN = function(j){range(X[,j])}))){

The input variable X is a matrix object containing the $n \times p$ matrix of points to be binned, and numbins is a numeric variable specifying the desired number of points in the binned configuration. Although the remaining input variables have supplied defaults, treebin is set up to allow complete customization of the binning process. The following subsections will detail the requirements of each input variable, and demonstrate how a prospective user can create their own binning methodology using the treebinr framework.

The output of treebin is a treebinr object.

binMeasure

The binMeasure function takes the following form

binMeasure <- function(bin, 
                       inputs = NULL)

Here, the bin input variable is an object of class bin on which we would like to compute (or update) the measure slot of the object. The inputs variable is a named list containing all of the additional information required to execute the function, with a default value of NULL. For instance, in the built in gapMeasure function, the range of each dimension is required to compute the measure for each bin, as well as tau, a numeric input. These are passed to the binMeasure function as:

inputs <- list(dimRange = sapply(1:nCols, FUN = function(j) {diff(range(X[,j]))}),
               tau = 1)

The intended use of binMeasure is to act as a score (or be something that can be interpreted as a score) during the bin selection stage. The output of binMeasure, however, is entirely up to the user, but must be compatible with the selectBin function, which is also user defined. For instance, in the gapBin methodology, binMeasure returns a matrix containing the computed gaps between all points in the bin, and the selectBin function manipulates this matrix for each bin object, and chooses the bin to be split (see the GapBin vignette for more information).

A very simple example would be to assign each bin a score equivalent to the number of points it contains:

binMeasure <- function(bin, 
                       inputs = NULL){
  out <- nrow(bin@contents)
  return(out)
}

selectBin

The selectBin function takes the following form:

selectBin <- function(bins,
                      inputs = NULL)

The bins input variable is now a list object containing all of the bin objects to date, and the inputs variable is once again a named list containing any addition information required in the selectBin function.

The intended use of selectBin is to evaluate, or choose between, the measure slots of each bin object, to choose the bin that will be split in the current iteration. As with all of the other functions, selectBin is completely general, so the user can include any other steps required to find this bin. For example, in the GapBin methodology, selectBin first finds the maximum measure value from each bin object (recall the measure slot contains matrices), before determining which bin is to be chosen.

Continuing with the example from the previous section, suppose we wanted to simply split the bin that had the largest number of points. Since our measure slot in each bin is the bin counts, we need only choose the bin that has the largest measure:

selectBin <- function(bins,
                      inputs = NULL){
  out <- sapply(bins,
                FUN = function(bin){bin@measure})
  return(which.max(out))
}

splitBin

The splitBin function is used to split a single bin into several using a custom method specified by the user. It takes the following form:

splitBin <- function(bin, binMeasure, inputs)

Here, the bin input is a bin object to be split, binMeasure is the function used to determine the measure associated with the newly formed bins, and inputs is a named list containing any addtional information required in splitBin. The function must return a list containing newly constructed bin objects.

The function can be quite complex, as it requires new bin objects to be constructed. Suppose that we have chosen a bin, but we don't care on which dimension the split occurs. A proper function would take the following form.

gapSplit <- function(bin, binMeasure, inputs){

  #Extract the bin information
  currentMeasure <- bin@measure
  currentBin <- bin@contents
  currentBoundary <- bin@info$binRange

  #Find the maximum measure in each dimension, and take the overall maximum as the dimension to be split
  whichAxis <- sample(1:nrow(currenBin), 1)

  #sort the bin along the chosen axis
  sortedBin <- currentBin[order(currentBin[,whichAxis]),]

  #Split the sorted bin at the maximum gap
  gapIndex <- which.max(currentMeasure[,whichAxis])

  leftContents <- sortedBin[seq(1, gapIndex, 1),,drop=FALSE]
  rightContents <- sortedBin[seq(gapIndex + 1, nrow(currentBin), 1),,drop=FALSE]

  #Define the new bin boundaries
  newBoundary <- (max(leftContents[,whichAxis]) + min(rightContents[,whichAxis]))/2   #Split at the midpoint of the maximum gap

  leftBoundary <- currentBoundary
  leftBoundary[2,whichAxis] <- newBoundary #The boundary of the left leaf is the same as the parent, except a new maximum along the split dimension

  rightBoundary <- currentBoundary
  rightBoundary[1,whichAxis] <- newBoundary  #The boundary of the right leaf is the same as the parent, except a new minimum along the split dimension

  #Put together the new bins in a binfo object
  leftBin <- bin(boundary = bin@boundary, contents = leftContents, measure = NULL, info = list(binRange = leftBoundary))
  rightBin <- bin(boundary = bin@boundary, contents = rightContents, measure = NULL, info = list(binRange = rightBoundary))

  #Compute the new bin measures
  leftBin@measure <- binMeasure(leftBin, inputs)
  rightBin@measure <- binMeasure(rightBin, inputs)

  #Return the new bins in a list
  return(list(leftBin, rightBin))

}

makePoint

The makePoint function is used after all of the desired split have been determined (i.e. we have achieved numbins bins), and determines the representative point for each bin. It takes the following form:

makePoint <- function(bin,
                      inputs = NULL)

Again, the bin input is an object bin, and inputs is a named list variable containing additional inputs for makePoint. The function must return a vector objects containing the representative point.

A simple choice, and the one used in GapBin, is to simply take the by-dimension mean of the contents in the bin:

makePoint <- function(bin,
                      inputs = NULL){
  point <- apply(bin@contents, 2, mean)
  return(point)
}

binInfo and inputs

The binInfo object is simply a named list object containing any additional information that the user would like to include with additional bins. FOr instance, in the GapBin methodology, the maximum/minimum of each bin is stored for use in boundaryTest. Thus, at initialization of the algorithm, the initial bin needs to contain the range of the original data, so binInfo is constructed as:

binInfo <- list(binRange = sapply(1:nCols, FUN = function(j) {range(X[,j])}))

Similarly, the inputs variable contains any additional information that the user requires within treebin, but is not necessarily needed in each bin. For instance, in GapBin, to compute the bin measure, the algorithm requires the range of the original dimensions of the data and an additional input variable tau. As such, the inputs variable takes:

inputs <- list(dimRange = sapply(1:nCols, FUN = function(j) {diff(range(X[,j]))}),
               tau = 1)

boundaryTest

The boundaryTest function is required in the addPoint function (discussed in later sections). It is attached to each bin individually (stored in the boundary slot of the bin object), meaning that a user could (potentially) have a a unique test for individual bins. It has the following form:

boundaryTest <- function(point,
                            bin,
                            inputs = NULL)

Here, the point input is a vector containing the point to be tested, the bin input is a bin object for which we would like to test the point, and the inputs variable is a named list containing any additional inputs required.

For the case where the bins are rectangular, a test of inclusion simply involves checking if the point falls within the rectangular bounds of each bin. This is why, for GapBin, the bin maximum/minimum is carried along in the info slot of each bin object.

boundaryTest <- function(point, bin, inputs){

  point <- matrix(point,nrow=1)
  boundary <- bin@info$binRange
  indicator <- c()

  for(i in 1:ncol(point)){
    indicator <- c(indicator, (point[i] < boundary[2,i] & point[i] > boundary[1,i]))
  }

  if(all(indicator)){
    return(TRUE)
  }else{
    return(FALSE)
  }
}

An Example using the GapBin Methodology

All of the required input to execute the GapBin methodology (which is described in detail in the GapBin vignette) is implemented as the default behaviour of treebin. Consider decreasing the size of a bivariate normal point configuration of size 1000 to 500 using the GapBin methodology of treebin

set.seed(1337)
X <- matrix(rnorm(2000),ncol=2)
numbins <- 500

binMeasure <- gapMeasure
selectBin <- gapSelect
splitBin <- gapSplit
boundaryTest <- gapBoundaryTest
makePoint <- gapPoints
dimRange <- sapply(1:ncol(X), FUN = function(j) {diff(range(X[,j]))})
inputs <- list(dimRange = dimRange, tau=1)
binInfo <- list(binRange = sapply(1:ncol(X), FUN = function(j) {range(X[,j])}))

out <- treebin(X, numbins, binMeasure, boundaryTest, selectBin, splitBin, makePoint, binInfo, inputs)

p1 <- ggplot(as.data.frame(X), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

p2 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

grid.arrange(p1,p2,nrow=1)

Supplementary Functions

The treebinr package has a number of supplementary functions that act on the output provided by the treebin object.

addPoint

As the name suggests, the addPoint function adds a point to the already binned configuration by first determining the existing bin to which it belongs, and then updating the bin by adding the point to it. The function takes the following form:

addPoint(point,
         treebinr_obj,
         binMeasure,
         makePoint,
         inputs = NULL)

Here, point is a numeric vector containing the point to be added, treebinr_obj is an object of class treebinr to which the point will be added, binMeasure and makePoint are the user defined functions to compute the measure of the bin and create the representative point respectively, and inputs is a named list containing additional inputs to the function.

Continuing with the example above, consider adding a point to the configuration at $(-3,3)$. The image on the left is the binned configuration, and the image on the right is the binned configuration with a point added at $(-3,3)$ (and incorporated into the representative point for the bin in which it fell)

point <- c(-3,3)
treebinr <- out      #From the original bin done above
binMeasure <- gapMeasure
makePoint <- gapPoints
inputs <- list(tau = 1, dimRange = dimRange)

out2 <- addPoint(point, treebinr, binMeasure, makePoint,  inputs)

p1 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

p2 <- ggplot(as.data.frame(out2@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

grid.arrange(p1, p2, nrow=1)

doNextSplit

Suppose we were interested in seeing what the next split in the sequence would be if we had requested the algorithm to produce one additional bin. Alternatively, suppose we added a point in an obscure location (such as we did above), and we wanted to see if producing one additional point would seperate the added point.

The doNextSplit function accomplishes this task, and takes the following form:

doNextSplit(treebinr_obj, selectBin, splitBin, binMeasure, makePoint, inputs)

As before, the treebinr_obj is an object of class treebinr eminating from the treebin function, selectBin, splitBin, binMeasure, and makePoint are the user supplied functions to select, split, compute the measure, and compute the representative point of the bin respectively, and inputs is a named list of additional input variables.

As an example, consider performing the next split on the configuration with the added point from above. The image on the left is of the confguration with the additional point added, and the image on the right the same configuration after executing one additional split.

treebin_obj <- out2
selectBin <- gapSelect
splitBin <- gapSplit
binMeasure <- gapMeasure
inputs <- list(tau = 1, dimRange = dimRange)

out3 <- doNextSplit(treebin_obj, selectBin, splitBin, binMeasure, makePoint, inputs)

p1 <- ggplot(as.data.frame(out2@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

p2 <- ggplot(as.data.frame(out3@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

grid.arrange(p1,p2,nrow=1)

undoLastSplit

Suppose we weren't happy with the results of doNextSplit, or we wanted to explore having fewer bins in the configuration. For this purpose, we can use the undoLastSplit function, which works opposite the doNextSplit function. Using the stored tree, we are able to traverse it in reverse, undoing the last split performed by the treebin function.

The undoLastSplit function takes the following form

undoLastSplit(treebinr_obj, binMeasure, makePoint, updateBin, inputs)

Where as before, treebinr_obj is an object of class treebinr, binMeasure and makePoint are user supplied functions to compute the measure and representative point of the newly formed bin, and inputs is again a named list containing additional variables. The new input not seen before is updateBin, which is a user supplied function that takes a set of pre-determined bins (i.e. the bins that are being joined), and turns them into a single bin object. The function takes the following form:

updateBin(bins, binMeasure, inputs)

Here, bins is a list containing the bin objects to be combined, binMeasure is the function to compute the measure of the newly formed bin, and inputs is a named list containing additional variables.

As an example, suppose we wanted to undo the split we performed in the previous step:

treebinr_obj <- out3 
binMeasure <- gapMeasure 
makePoint <- gapPoints
updateBin <- gapUpdate 
inputs <- list(tau = 1, dimRange = dimRange)

out4 <- undoLastSplit(treebinr_obj, binMeasure, makePoint, updateBin, inputs)

p1 <- ggplot(as.data.frame(out3@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

p2 <- ggplot(as.data.frame(out4@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

grid.arrange(p1,p2,nrow=1)

split

Suppose instead of wanting to simply execute the next split as chosen by the treebin algorithm, we instead had a specifi bin that we wanted to split. For the purpose, we use split, which has the following form:

split(bin, treebinr_obj, binMeasure, splitBin, makePoint, inputs)

Here, bin is an object of class bin to be split, treebinr_obj is an object of class treebinr, binMeasure, splitBin, and makePoint are user defined functions to compute the measure, split, and compute the representative point of the newly formed bins respectively, and inputs is a named list object containing additional input variables.

As an example, we split a random node from the original binned configuration. The configuration on the left is the original binned configuration, and the figure on the right is the configuration with an additional split node.

target <- out@bins[[490]]

treebinr_obj <- out
binMeasure <- gapMeasure
makePoint <- gapPoints
inputs <- inputs

out5 <- split(target, out, binMeasure, gapSplit, gapPoints, inputs)

p1 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

p2 <- ggplot(as.data.frame(out5@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

grid.arrange(p1,p2,nrow=1)

prune

Suppose now we want to go in the opposite direction - instead of splitting a given node, we want to re-combine it with its sister nodes (an operation generally known as pruning). For this purpose, we have prune, which takes a bin object as input, and combines it with all of its sister nodes to create a single node. It takes the following form:

prune(bin, treebinr_obj, binMeasure, makePoint, updateBin, inputs)

where bin is the bin object to be pruned, treebinr_obj is an objcet of class treebinr, binMeasure, makePoint, and updateBin are user supplied functions to compute the bin measure, create the updated bin, and compute the representative point, and inputs is a named list containing any additional variables.

We consider a simple example of pruning an arbitrarily chosen bin. The image on the left is the original binned configuration, and the image on the right is the pruned configuration.

set.seed(1337)

node <- out@bins[[sample(1:500, 1)]]
treebinr_obj <- out
binMeasure <- gapMeasure
makePoint <- gapPoints
inputs <- inputs

out7 <- prune(node, treebinr_obj, binMeasure, makePoint, gapUpdate, inputs)


p1 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

p2 <- ggplot(as.data.frame(out7@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

grid.arrange(p1,p2,nrow=1)

trim

Once we have a binned configuration, we may want to combine bins that have a low measure value, an operation generally referred to as trimming. For this operation, we use the function trim, which takes the following form:

trim(val,
     treebinr_obj,
     getMeasure = function(bin, inputs){bin@measure},
     binMeasure,
     makePoint,
     updateBin,
     inputs = NULL)

where val is the maximum value of measure to retain (i.e. all nodes with a lower value of measure will be combined with their sister nodes), treebinr_obj is an object of class treebinr, binMeasure, updateBin, and makePoint are function to compute the bin measure, create the newly formed bin, and create its representative point respectively. The getMeasure input is a user supplied function to extract, and possibly alter, the measure values of each bin object. The default value is to simply extract the measure of each bin.

For our previous example, consider trimming all nodes with a meaure value below $0.01$:

val <- .01
treebinr_obj <- out 
getMeasure <- function(bin, inputs){
  max <- max(bin@measure)
  return(max)
}
binMeasure <- gapMeasure
makePoint <- gapPoints  
boundaryTest <- gapBoundaryTest
updateBin <- gapUpdate 
inputs <- inputs

out8 <- trim(val, treebinr_obj, getMeasure, binMeasure, makePoint, updateBin, inputs)

p1 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

p2 <- ggplot(as.data.frame(out8@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

grid.arrange(p1,p2,nrow=1)

trimTo

Suppose instead of trimming below a specific measure threshold, we instead wanted to trim the configuration to a given number of points. To do this, we use the trimTo function, which first computes the value of measure that would result in a trim to the desired bumber of points, before executing the trim function discussed above. trimTo takes the following form:

trimTo <- function(num, 
                   treebinr_obj,
                   getMeasure = function(bin, inputs){bin@measure},
                   binMeasure,
                   makePoint,
                   updateBin,
                   inputs = NULL)

where the only input that differs from those in trim is num, the number of desired points in the trimmed configuration.

Where we previously binned the configuration to 500 bins, suppose we wanted only 350. We can do this in one of two way: Binning the original configuration directly to 350 bins, or trim the previously binned configuration back to 350. The result is two interestingly different configurations (direct binning on the left, trimming on the right).

numbins <- 350

#Bin the original configuration directly to 350
out9 <- treebin(X, numbins, binMeasure, boundaryTest, selectBin, splitBin, makePoint, binInfo, inputs)

#Trim the 500 bin configuration to 350
treebinr_obj <- out 
getMeasure <- function(bin, inputs){
  max <- max(bin@measure)
  return(max)
}
binMeasure <- gapMeasure
makePoint <- gapPoints  
boundaryTest <- gapBoundaryTest
updateBin <- gapUpdate 
inputs <- inputs

out10 <- trimTo(numbins, treebinr_obj, getMeasure, binMeasure, makePoint, updateBin, inputs)

p1 <- ggplot(as.data.frame(out9@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

p2 <- ggplot(as.data.frame(out10@points), aes(x=V1, y=V2)) + 
  geom_point(alpha=.5) +
  xlab("") +
  ylab("") +
  theme_bw()

grid.arrange(p1,p2,nrow=1)

rwoldford/treebinr documentation built on May 12, 2019, 4:38 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rwoldford/treebinr
Data Reduction via Tree-based Binning

In rwoldford/treebinr: Data Reduction via Tree-based Binning

Introduction

Object Classes

The Main Function - treebin

binMeasure

selectBin

splitBin

makePoint

binInfo and inputs

boundaryTest

An Example using the GapBin Methodology

Supplementary Functions

addPoint

doNextSplit

undoLastSplit

split

prune

trim

trimTo

R Package Documentation

Browse R Packages

We want your feedback!

rwoldford/treebinr Data Reduction via Tree-based Binning

In rwoldford/treebinr: Data Reduction via Tree-based Binning

Introduction

Object Classes

The Main Function - treebin

binMeasure

selectBin

splitBin

makePoint

binInfo and inputs

boundaryTest

An Example using the GapBin Methodology

Supplementary Functions

addPoint

doNextSplit

undoLastSplit

split

prune

trim

trimTo

R Package Documentation

Browse R Packages

We want your feedback!

rwoldford/treebinr
Data Reduction via Tree-based Binning