Estimation Procedures, Reweighters, Aggregators, and Performance Analyzers


An unfortunate reality about the boostr framework is that it's a bit jargon heavy. To take full advantage of the modularity behind boostr you'll want to understand the following terms: "estimation procedure", "reweighter", "aggregator", and "performance analyzer".

This document will define each term and give examples. While the definitions stand on their own, certain examples will build off each other, so be warned!

Estimation Procedures

A few examples

At a high level, an estimation procedure is any black-box algorithm that learns from some data and spits out an estimator -- some function that can take data and return estimates. This may seem a bit convoluted, so lets look at two prototypical examples with $k$-NN and svm.


Now you may be thinking, "what's the big deal here?" kNN_EstimationProcedure is just a wrapper around class::knn. To that I would say, "keen eye, my friend" -- I'll address that in a moment; however, things get a bit more interesting with svm_EstimationProcedure where we see that our function involves a "training" step -- the call to e1071::svm -- and then returns a function (a closure) that has access to the trained model, to perform prediction. Since an estimation procedure is supposed to be the thing that trains the model we want to make estimates from, it's very reasonable to consider e1071::svm an estimation procedure. However, it would be incorrect to consider predict, by itself an estimator. Really, the wrapper around predict that gives it access to the model built by e1071::svm is the estimator, since this is the object we use to generate estimates.

Now, back to the $k$-NN example. How is this a demonstration of the estimation procedure-estimator setup we're trying to cultivate? Well, in this particular instance, the $k$-NN algorithm doesn't have a dedicated "training" step. The model built in the $k$-NN algorithm is the learning set. Thus, $k$-NN can skip the model building step we saw in the svm example and go straight to the prediction step. Hence, our estimation procedure is just a wrapper around class::knn that makes sure we're using the learning set.

Mathematical definition of an estimation procedure

For those of you who are more mathematically inclined, you can think of estimation procedures in the following way: suppose you had a learning set $\mathcal{L}_n = \left{(x_1, y_1), \ldots, (x_n, y_n)\right}$ of $n$ observations $(x_i, y_i)$ where $x_i \in \mathcal{X} \subseteq \mathbb{R}^{J}$ and $y_i \in \mathcal{Y} \subseteq \mathbb{R}$ and mapping, $\widehat{\Psi} : \mathcal{L}_n \to \left{f \mid f : \mathcal{X} \to \mathcal{Y}\right}$. We call $\widehat{\Psi}$ an estimation procedure and the function $\psi_n = \widehat{\Psi}(\mathcal{L}_n)$ an estimator. Note that since we're in the world of probability and statistics, the $x$'s and $y$'s are realizations of random variables, and so for a fixed $n$, your learning set, $\mathcal{L}_n$ is also a realization of a random object. Hence, the estimation procedure is actually a function on the space of learning sets.

Technicalities aside, the most profitable way of thinking about estimation procedures ($\widehat{\Psi}$) is that they are black-box algorithms that spit out functions ($\psi_n$) which can take predictors like $x_i$ and spit out predictions, $\hat{y}_i$.

Estimation procedures in boostr

This is all well and good, but how does this apply to you, the boostr user? Well, boostr lets you use your own estimation procedures in boostr::boost. However, to do so, boostr::boost needs to make sure the object you're claiming to be an estimation procedure is, infact, an estimation procedure.

A priori, boostr assumes that all estimation procedures:

The last detail is just a minor detail; the first two requirements are more important. Basically, if you can rewrite the signature of your estimation procedure's signature to match (data, ...), and it's output's signature to match (newdata, ...), boostr::boost can Boost it. However, boostr::boost doesn't do this with black-magic, it needs to know information about your estimation procedure. Specifically, boostr::boost has an argument, metadata, which is a named list of arguments to pass to boostr Wrapper Generators written for the express purpose of taking things like your estimation procedure, and creating objects whose signatures and output are compatible inside boostr.


For estimation procedures, the relevant Wrapper Generators are boostr::wrapProcedure and boostr::buildEstimationProcedure -- when boostr::boost calls them depends entire on the x argument to boostr::boost. Ignoring this caveat for a moment, let's consider what we would have to do turn kNN_EstimationProcedure in the $k$-NN example into a boostr-compatible estimation procedure. First, its signature is (k, learningSet), so we'd want a wrapper function(data, ...) where data corresponds to learningSet and then have ... take care of k. boostr can build this for you, if you include the entry learningSet="learningInput" in the metadata entry of boostr::boost and pass the value of k in as a named entry in .procArgs -- see this example where kNN_EstimationProcedure is boosted according to the arc-x4 algorithm. Since we're wrapping a whole procedure, and not a closure that combines the train-predict pattern (like in the svm example), the metadata arguments we'll want to use are the arguments corresponding to boostr::wrapProcedure. See the help page for the details on boostr::wrapProcedure's signature.

````r boostr::boostWithArcX4(x = kNN_EstimationProcedure, B = 3, data = Glass, metadata = list(learningSet="learningSet"), .procArgs = list(k=5), .boostBackendArgs = list( .subsetFormula=formula(Type~.)) ) wzxhzdk:3 Reweighters ======== Motivation ------------ The whole idea behind Boosting is to adaptively resample observations from the learning set, and train estimators on these (weighted) samples of learning set observations. Specifically, we want to be able to take the performance of a particular estimator and the weights we used to draw the set it was trained on, and come up with new weights. The formal mechanism for doing this is a "reweighter". That is, a reweighter looks at the weights an estimator was trained on and its performance on the *original* learning set, and spits out a new set of weights, suggesting where we may want to focus more attention during the training of our next estimator. (It may return addition input, but let's not get ahead of ourselves.) Examples --------------- `boostr` implements a few classic reweighters out of the box: `boostr::arcfsReweighter`, `boostr::arcx4Reweighter`, `boostr::adaboostReweighter`, and `boostr::vanillaBagger`. wzxhzdk:4 Reweighters in `boostr` ---------------- You'll notice that all the implemented reweighters in `boostr` have the followign in common: 1. Their signatures are of the form `(prediction, response, weights, ...)`; in this signature, `prediction` represents an estimator's prediction (vector), `response` represents the true response (comes from the learning set) and `weights` is the weight associated to the observation in `response`. Hence, all three arguments are meant to be vectors of the same length. 2. They output named lists that contain an entry named `weights`, and 3. They inherit from the class `reweighter`. These are the requirements for any function to be compatible inside `boostr`. Hence, to use your own reweighter in `boostr::boost` you can either write a function from scratch that satistifies these requirements, or if you have one already pre-implemented you can let `boostr::boost` build a wrapper around it using `boostr::wrapReweighter`. This is done by passing the appropriately named arguments to `boostr::wrapReweighter` through `boostr::boost`'s `metadata` argument. See the example where we Boost an svm with a (rather silly) reweighter that permutes weights.
Aggregators ======= Motivation ------------ Once we're done building all these estimators, we're going to want to get a single estimate out of them. After all, you didn't have to go through all the trouble of downloading this package if all you wanted was a cacophony of estimates. This is where aggregators come in; aggregators take your ensemble of estimators and returns a single, aggregated, estimator. Examples --------------- `boostr` implements a few classic aggregators out of the box: `boostr::arcfsAggregator`, `boostr::arcx4Aggregator`, `boostr::adaboostAggregator`, `boostr::weightedAggregator` and `boostr::vanillaAggregator`. wzxhzdk:6 Aggregators in `boostr` ------------------------ You'll notice that all the implemented aggregators in `boostr` have the following in common: 1. Their signatures have the form `(estimators, ...)`, where `estimators` represents an ensemble of estimators, 2. They return a function of a single argument `newdata`, and 3. They inherit from the class `aggregator`. These are the requirements for any function to be compatible inside `boostr`. Note that the `...`'s are necessary for an aggregator since `boostr::boostBackend` pipes the (named) reweighter ouput to the aggregator, so this allows aggregators to ignore irrelevant reweighter output. Like with reweighters, you can use your own aggregator by letting `boostr::boost` build a wrapper using `boostr::wrapAggregator`. See below for an example where we Boost an svm with a contrived aggregator that only considers the second estimator. Consult `boostr::wrapAggregator`'s help documentation for the details on the arguments you need to pass to `metadata` to properly wrap your aggregator.
Performance Analyzers ======= The idea of a performance analyzer isn't really specific to boosting, or estimation, for that matter. These functions are just routines called once a new estimator has been trained to calculate some performance statistics of the estimator. The default performance analyzer is `boostr::defaultOOBPerformanceAnalysis` which calculates the out-of-bag performance of an estimator. wzxhzdk:8 The only requirements that a `boostr` compatible performance analyzer meet is that 1. Its signature include arguments `prediction`, `response`, and `oobObs`, and 2. It inherits from the `performanceAnalyzer` class. Any of its output is (appropriately) organized in the `estimatorPerformance` atrribute of the `boostr` object returned from `boostr::boost`. To pass any additional arguments to a performance analyzer, put `.analyzePerformanceArgs = list(...)` inside the `.boostBackendArgs` args of `boostr::boost`.

Try the boostr package in your browser

Any scripts or data that you put into this service are public.

boostr documentation built on May 2, 2019, 1:42 p.m.