$\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\ve}[1]{\mathbf{#1}}$ $\renewcommand{\m}[1]{\mathbf{#1}}$ $\renewcommand{\sv}[1]{\boldsymbol{#1}}$ $\renewcommand{\pop}[1]{\mathcal{#1}}$ $\renewcommand{\samp}[1]{\mathcal{#1}}$ $\renewcommand{\imply}{\Longrightarrow}$ $\renewcommand{\leftgiven}{~\left\lvert~}$ $\renewcommand{\given}{~\vert~}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\widebar}[1]{\overline{#1}}$ $\renewcommand{\wig}[1]{\tilde{#1}}$ $\renewcommand{\bigwig}[1]{\widetilde{#1}}$ $\renewcommand{\field}[1]{\mathbb{#1}}$ $\renewcommand{\Reals}{\field{R}}$ $\renewcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert}$ $\renewcommand{\size}[1]{\left\lvert {#1} \right\rvert}$ $\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\norm}[1]{\left|\left|{#1}\right|\right|}$ $\renewcommand{\intersect}{\cap}$ $\renewcommand{\union}{\cup}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\st}{~:~}$
library(treebinr) library(ggplot2) library(grid) library(gridExtra)
In this vignette, we introduce the basic functionality of the treebinr
package, by using the methodology of GapBin
which is the default functionality. Technical details can be found in the accompanying GapBin
vignette, and the accompanying JSS vignette.
The treebin
package works with two object classes, bin
and treebinr
. The bin
object class is used internally in treebin
, and is also part of the output provided by the function. It is an S4 class with the following slots:
|Slot |Contents |
|--------:|:--------|
|boundary | A function to test the inclusion of a point in the bin (i.e. a boundaryTest
function) |
|contents | A matrix containing the points in the bin |
|measure | The measure (as computed by binMeasure
) associated with the bin |
|index | The index relating the bin to the internal decision tree |
|info | Additional information specified by the user |
The treebinr
object is the final output provided by treebin
. It is an S4 object with the following slots:
|Slot |Contents |
|--------:|:--------|
|points | A matrix containing the representative points of each bin |
|counts | A numeric vector containing the number of points in each bin |
|bins | A list containing the final bin
objects created during the treebin
process |
|tree | A matrix representing an directed graph of the binning process|
The main function available in the treebinr
package is treebin
, which takes the following form:
treebin <- function(X, numbins, binMeasure = gapMeasure, boundaryTest = gapBoundaryTest, selectBin = gapSelect, splitBin = gapSplit, makePoint = gapPoint, binInfo = list(binRange = sapply(1:ncol(X), FUN = function(j){range(X[,j])})), inputs = list(dimRange = sapply(1:ncol(X), FUN = function(j){range(X[,j])}))){
The input variable X
is a matrix object containing the $n \times p$ matrix of points to be binned, and numbins
is a numeric variable specifying the desired number of points in the binned configuration. Although the remaining input variables have supplied defaults, treebin
is set up to allow complete customization of the binning process. The following subsections will detail the requirements of each input variable, and demonstrate how a prospective user can create their own binning methodology using the treebinr
framework.
The output of treebin
is a treebinr
object.
The binMeasure
function takes the following form
binMeasure <- function(bin, inputs = NULL)
Here, the bin
input variable is an object of class bin
on which we would like to compute (or update) the measure
slot of the object. The inputs
variable is a named list containing all of the additional information required to execute the function, with a default value of NULL. For instance, in the built in gapMeasure
function, the range of each dimension is required to compute the measure for each bin, as well as tau
, a numeric input. These are passed to the binMeasure
function as:
inputs <- list(dimRange = sapply(1:nCols, FUN = function(j) {diff(range(X[,j]))}), tau = 1)
The intended use of binMeasure
is to act as a score (or be something that can be interpreted as a score) during the bin selection stage. The output of binMeasure
, however, is entirely up to the user, but must be compatible with the selectBin
function, which is also user defined. For instance, in the gapBin
methodology, binMeasure
returns a matrix containing the computed gaps between all points in the bin, and the selectBin
function manipulates this matrix for each bin
object, and chooses the bin to be split (see the GapBin
vignette for more information).
A very simple example would be to assign each bin a score equivalent to the number of points it contains:
binMeasure <- function(bin, inputs = NULL){ out <- nrow(bin@contents) return(out) }
The selectBin
function takes the following form:
selectBin <- function(bins, inputs = NULL)
The bins
input variable is now a list object containing all of the bin
objects to date, and the inputs
variable is once again a named list containing any addition information required in the selectBin
function.
The intended use of selectBin
is to evaluate, or choose between, the measure
slots of each bin
object, to choose the bin that will be split in the current iteration. As with all of the other functions, selectBin
is completely general, so the user can include any other steps required to find this bin. For example, in the GapBin
methodology, selectBin
first finds the maximum measure
value from each bin
object (recall the measure
slot contains matrices), before determining which bin is to be chosen.
Continuing with the example from the previous section, suppose we wanted to simply split the bin that had the largest number of points. Since our measure
slot in each bin is the bin counts, we need only choose the bin that has the largest measure:
selectBin <- function(bins, inputs = NULL){ out <- sapply(bins, FUN = function(bin){bin@measure}) return(which.max(out)) }
The splitBin
function is used to split a single bin into several using a custom method specified by the user. It takes the following form:
splitBin <- function(bin, binMeasure, inputs)
Here, the bin
input is a bin
object to be split, binMeasure
is the function used to determine the measure associated with the newly formed bins, and inputs
is a named list containing any addtional information required in splitBin
. The function must return a list containing newly constructed bin
objects.
The function can be quite complex, as it requires new bin objects to be constructed. Suppose that we have chosen a bin, but we don't care on which dimension the split occurs. A proper function would take the following form.
gapSplit <- function(bin, binMeasure, inputs){ #Extract the bin information currentMeasure <- bin@measure currentBin <- bin@contents currentBoundary <- bin@info$binRange #Find the maximum measure in each dimension, and take the overall maximum as the dimension to be split whichAxis <- sample(1:nrow(currenBin), 1) #sort the bin along the chosen axis sortedBin <- currentBin[order(currentBin[,whichAxis]),] #Split the sorted bin at the maximum gap gapIndex <- which.max(currentMeasure[,whichAxis]) leftContents <- sortedBin[seq(1, gapIndex, 1),,drop=FALSE] rightContents <- sortedBin[seq(gapIndex + 1, nrow(currentBin), 1),,drop=FALSE] #Define the new bin boundaries newBoundary <- (max(leftContents[,whichAxis]) + min(rightContents[,whichAxis]))/2 #Split at the midpoint of the maximum gap leftBoundary <- currentBoundary leftBoundary[2,whichAxis] <- newBoundary #The boundary of the left leaf is the same as the parent, except a new maximum along the split dimension rightBoundary <- currentBoundary rightBoundary[1,whichAxis] <- newBoundary #The boundary of the right leaf is the same as the parent, except a new minimum along the split dimension #Put together the new bins in a binfo object leftBin <- bin(boundary = bin@boundary, contents = leftContents, measure = NULL, info = list(binRange = leftBoundary)) rightBin <- bin(boundary = bin@boundary, contents = rightContents, measure = NULL, info = list(binRange = rightBoundary)) #Compute the new bin measures leftBin@measure <- binMeasure(leftBin, inputs) rightBin@measure <- binMeasure(rightBin, inputs) #Return the new bins in a list return(list(leftBin, rightBin)) }
The makePoint
function is used after all of the desired split have been determined (i.e. we have achieved numbins
bins), and determines the representative point for each bin. It takes the following form:
makePoint <- function(bin, inputs = NULL)
Again, the bin
input is an object bin
, and inputs
is a named list variable containing additional inputs for makePoint
. The function must return a vector objects containing the representative point.
A simple choice, and the one used in GapBin
, is to simply take the by-dimension mean of the contents in the bin:
makePoint <- function(bin, inputs = NULL){ point <- apply(bin@contents, 2, mean) return(point) }
The binInfo
object is simply a named list object containing any additional information that the user would like to include with additional bins. FOr instance, in the GapBin
methodology, the maximum/minimum of each bin is stored for use in boundaryTest
. Thus, at initialization of the algorithm, the initial bin needs to contain the range of the original data, so binInfo
is constructed as:
binInfo <- list(binRange = sapply(1:nCols, FUN = function(j) {range(X[,j])}))
Similarly, the inputs
variable contains any additional information that the user requires within treebin
, but is not necessarily needed in each bin. For instance, in GapBin
, to compute the bin measure, the algorithm requires the range of the original dimensions of the data and an additional input variable tau
. As such, the inputs
variable takes:
inputs <- list(dimRange = sapply(1:nCols, FUN = function(j) {diff(range(X[,j]))}), tau = 1)
The boundaryTest
function is required in the addPoint
function (discussed in later sections). It is attached to each bin
individually (stored in the boundary
slot of the bin
object), meaning that a user could (potentially) have a a unique test for individual bins. It has the following form:
boundaryTest <- function(point, bin, inputs = NULL)
Here, the point
input is a vector containing the point to be tested, the bin
input is a bin
object for which we would like to test the point, and the inputs
variable is a named list containing any additional inputs required.
For the case where the bins are rectangular, a test of inclusion simply involves checking if the point falls within the rectangular bounds of each bin. This is why, for GapBin
, the bin maximum/minimum is carried along in the info
slot of each bin
object.
boundaryTest <- function(point, bin, inputs){ point <- matrix(point,nrow=1) boundary <- bin@info$binRange indicator <- c() for(i in 1:ncol(point)){ indicator <- c(indicator, (point[i] < boundary[2,i] & point[i] > boundary[1,i])) } if(all(indicator)){ return(TRUE) }else{ return(FALSE) } }
All of the required input to execute the GapBin
methodology (which is described in detail in the GapBin
vignette) is implemented as the default behaviour of treebin
. Consider decreasing the size of a bivariate normal point configuration of size 1000 to 500 using the GapBin
methodology of treebin
set.seed(1337) X <- matrix(rnorm(2000),ncol=2) numbins <- 500 binMeasure <- gapMeasure selectBin <- gapSelect splitBin <- gapSplit boundaryTest <- gapBoundaryTest makePoint <- gapPoints dimRange <- sapply(1:ncol(X), FUN = function(j) {diff(range(X[,j]))}) inputs <- list(dimRange = dimRange, tau=1) binInfo <- list(binRange = sapply(1:ncol(X), FUN = function(j) {range(X[,j])})) out <- treebin(X, numbins, binMeasure, boundaryTest, selectBin, splitBin, makePoint, binInfo, inputs) p1 <- ggplot(as.data.frame(X), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() p2 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() grid.arrange(p1,p2,nrow=1)
The treebinr
package has a number of supplementary functions that act on the output provided by the treebin
object.
As the name suggests, the addPoint
function adds a point to the already binned configuration by first determining the existing bin to which it belongs, and then updating the bin by adding the point to it. The function takes the following form:
addPoint(point, treebinr_obj, binMeasure, makePoint, inputs = NULL)
Here, point
is a numeric vector containing the point to be added, treebinr_obj
is an object of class treebinr
to which the point will be added, binMeasure
and makePoint
are the user defined functions to compute the measure of the bin and create the representative point respectively, and inputs
is a named list containing additional inputs to the function.
Continuing with the example above, consider adding a point to the configuration at $(-3,3)$. The image on the left is the binned configuration, and the image on the right is the binned configuration with a point added at $(-3,3)$ (and incorporated into the representative point for the bin in which it fell)
point <- c(-3,3) treebinr <- out #From the original bin done above binMeasure <- gapMeasure makePoint <- gapPoints inputs <- list(tau = 1, dimRange = dimRange) out2 <- addPoint(point, treebinr, binMeasure, makePoint, inputs) p1 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() p2 <- ggplot(as.data.frame(out2@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() grid.arrange(p1, p2, nrow=1)
Suppose we were interested in seeing what the next split in the sequence would be if we had requested the algorithm to produce one additional bin. Alternatively, suppose we added a point in an obscure location (such as we did above), and we wanted to see if producing one additional point would seperate the added point.
The doNextSplit
function accomplishes this task, and takes the following form:
doNextSplit(treebinr_obj, selectBin, splitBin, binMeasure, makePoint, inputs)
As before, the treebinr_obj
is an object of class treebinr
eminating from the treebin
function, selectBin
, splitBin
, binMeasure
, and makePoint
are the user supplied functions to select, split, compute the measure, and compute the representative point of the bin respectively, and inputs
is a named list of additional input variables.
As an example, consider performing the next split on the configuration with the added point from above. The image on the left is of the confguration with the additional point added, and the image on the right the same configuration after executing one additional split.
treebin_obj <- out2 selectBin <- gapSelect splitBin <- gapSplit binMeasure <- gapMeasure inputs <- list(tau = 1, dimRange = dimRange) out3 <- doNextSplit(treebin_obj, selectBin, splitBin, binMeasure, makePoint, inputs) p1 <- ggplot(as.data.frame(out2@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() p2 <- ggplot(as.data.frame(out3@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() grid.arrange(p1,p2,nrow=1)
Suppose we weren't happy with the results of doNextSplit
, or we wanted to explore having fewer bins in the configuration. For this purpose, we can use the undoLastSplit
function, which works opposite the doNextSplit
function. Using the stored tree, we are able to traverse it in reverse, undoing the last split performed by the treebin
function.
The undoLastSplit
function takes the following form
undoLastSplit(treebinr_obj, binMeasure, makePoint, updateBin, inputs)
Where as before, treebinr_obj
is an object of class treebinr
, binMeasure
and makePoint
are user supplied functions to compute the measure and representative point of the newly formed bin, and inputs
is again a named list containing additional variables. The new input not seen before is updateBin
, which is a user supplied function that takes a set of pre-determined bins (i.e. the bins that are being joined), and turns them into a single bin
object. The function takes the following form:
updateBin(bins, binMeasure, inputs)
Here, bins
is a list containing the bin
objects to be combined, binMeasure
is the function to compute the measure of the newly formed bin, and inputs
is a named list containing additional variables.
As an example, suppose we wanted to undo the split we performed in the previous step:
treebinr_obj <- out3 binMeasure <- gapMeasure makePoint <- gapPoints updateBin <- gapUpdate inputs <- list(tau = 1, dimRange = dimRange) out4 <- undoLastSplit(treebinr_obj, binMeasure, makePoint, updateBin, inputs) p1 <- ggplot(as.data.frame(out3@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() p2 <- ggplot(as.data.frame(out4@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() grid.arrange(p1,p2,nrow=1)
Suppose instead of wanting to simply execute the next split as chosen by the treebin
algorithm, we instead had a specifi bin that we wanted to split. For the purpose, we use split
, which has the following form:
split(bin, treebinr_obj, binMeasure, splitBin, makePoint, inputs)
Here, bin
is an object of class bin
to be split, treebinr_obj
is an object of class treebinr
, binMeasure
, splitBin
, and makePoint
are user defined functions to compute the measure, split, and compute the representative point of the newly formed bins respectively, and inputs
is a named list object containing additional input variables.
As an example, we split a random node from the original binned configuration. The configuration on the left is the original binned configuration, and the figure on the right is the configuration with an additional split node.
target <- out@bins[[490]] treebinr_obj <- out binMeasure <- gapMeasure makePoint <- gapPoints inputs <- inputs out5 <- split(target, out, binMeasure, gapSplit, gapPoints, inputs) p1 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() p2 <- ggplot(as.data.frame(out5@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() grid.arrange(p1,p2,nrow=1)
Suppose now we want to go in the opposite direction - instead of splitting a given node, we want to re-combine it with its sister nodes (an operation generally known as pruning). For this purpose, we have prune
, which takes a bin
object as input, and combines it with all of its sister nodes to create a single node. It takes the following form:
prune(bin, treebinr_obj, binMeasure, makePoint, updateBin, inputs)
where bin
is the bin object to be pruned, treebinr_obj
is an objcet of class treebinr
, binMeasure
, makePoint
, and updateBin
are user supplied functions to compute the bin measure, create the updated bin, and compute the representative point, and inputs
is a named list containing any additional variables.
We consider a simple example of pruning an arbitrarily chosen bin. The image on the left is the original binned configuration, and the image on the right is the pruned configuration.
set.seed(1337) node <- out@bins[[sample(1:500, 1)]] treebinr_obj <- out binMeasure <- gapMeasure makePoint <- gapPoints inputs <- inputs out7 <- prune(node, treebinr_obj, binMeasure, makePoint, gapUpdate, inputs) p1 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() p2 <- ggplot(as.data.frame(out7@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() grid.arrange(p1,p2,nrow=1)
Once we have a binned configuration, we may want to combine bins that have a low measure
value, an operation generally referred to as trimming. For this operation, we use the function trim
, which takes the following form:
trim(val, treebinr_obj, getMeasure = function(bin, inputs){bin@measure}, binMeasure, makePoint, updateBin, inputs = NULL)
where val
is the maximum value of measure
to retain (i.e. all nodes with a lower value of measure will be combined with their sister nodes), treebinr_obj
is an object of class treebinr
, binMeasure
, updateBin
, and makePoint
are function to compute the bin measure, create the newly formed bin, and create its representative point respectively. The getMeasure
input is a user supplied function to extract, and possibly alter, the measure values of each bin
object. The default value is to simply extract the measure of each bin.
For our previous example, consider trimming all nodes with a meaure value below $0.01$:
val <- .01 treebinr_obj <- out getMeasure <- function(bin, inputs){ max <- max(bin@measure) return(max) } binMeasure <- gapMeasure makePoint <- gapPoints boundaryTest <- gapBoundaryTest updateBin <- gapUpdate inputs <- inputs out8 <- trim(val, treebinr_obj, getMeasure, binMeasure, makePoint, updateBin, inputs) p1 <- ggplot(as.data.frame(out@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() p2 <- ggplot(as.data.frame(out8@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() grid.arrange(p1,p2,nrow=1)
Suppose instead of trimming below a specific measure threshold, we instead wanted to trim the configuration to a given number of points. To do this, we use the trimTo
function, which first computes the value of measure
that would result in a trim to the desired bumber of points, before executing the trim
function discussed above. trimTo
takes the following form:
trimTo <- function(num, treebinr_obj, getMeasure = function(bin, inputs){bin@measure}, binMeasure, makePoint, updateBin, inputs = NULL)
where the only input that differs from those in trim
is num
, the number of desired points in the trimmed configuration.
Where we previously binned the configuration to 500 bins, suppose we wanted only 350. We can do this in one of two way: Binning the original configuration directly to 350 bins, or trim the previously binned configuration back to 350. The result is two interestingly different configurations (direct binning on the left, trimming on the right).
numbins <- 350 #Bin the original configuration directly to 350 out9 <- treebin(X, numbins, binMeasure, boundaryTest, selectBin, splitBin, makePoint, binInfo, inputs) #Trim the 500 bin configuration to 350 treebinr_obj <- out getMeasure <- function(bin, inputs){ max <- max(bin@measure) return(max) } binMeasure <- gapMeasure makePoint <- gapPoints boundaryTest <- gapBoundaryTest updateBin <- gapUpdate inputs <- inputs out10 <- trimTo(numbins, treebinr_obj, getMeasure, binMeasure, makePoint, updateBin, inputs) p1 <- ggplot(as.data.frame(out9@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() p2 <- ggplot(as.data.frame(out10@points), aes(x=V1, y=V2)) + geom_point(alpha=.5) + xlab("") + ylab("") + theme_bw() grid.arrange(p1,p2,nrow=1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.