An unfortunate reality about the boostr
framework is that it's a bit jargon
heavy. To take full advantage of the modularity behind boostr
you'll want to
understand the following terms: "estimation procedure", "reweighter",
"aggregator", and "performance analyzer".
This document will define each term and give examples. While the definitions stand on their own, certain examples will build off each other, so be warned!
At a high level, an estimation procedure is any black-box algorithm that learns from some data and spits out an estimator -- some function that can take data and return estimates. This may seem a bit convoluted, so lets look at two prototypical examples with $k$-NN and svm.
Now you may be thinking, "what's the big deal here?" kNN_EstimationProcedure
is just a wrapper around class::knn
. To that I would say, "keen eye, my
friend" -- I'll address that in a moment; however, things get a bit more interesting with svm_EstimationProcedure
where
we see that our function involves a "training" step -- the call to e1071::svm
-- and then returns a function (a closure) that has access to the trained model, to perform prediction. Since an estimation procedure is supposed to be the thing that trains the model we want to make estimates from, it's very reasonable to consider e1071::svm
an estimation procedure. However, it would be incorrect to consider predict
, by itself an estimator. Really, the wrapper around predict
that gives it access to the model built by e1071::svm
is the estimator, since this is the object we use to generate estimates.
Now, back to the $k$-NN example. How is this a demonstration of the estimation procedure-estimator setup we're trying to cultivate? Well, in this particular instance, the $k$-NN algorithm doesn't have a dedicated "training" step. The model built in the $k$-NN algorithm is the learning set. Thus, $k$-NN can skip the model building step we saw in the svm example and go straight to the prediction step. Hence, our estimation procedure is just a wrapper around class::knn
that makes sure we're using the learning set.
For those of you who are more mathematically inclined, you can think of estimation procedures in the following way: suppose you had a learning set $\mathcal{L}_n = \left{(x_1, y_1), \ldots, (x_n, y_n)\right}$ of $n$ observations $(x_i, y_i)$ where $x_i \in \mathcal{X} \subseteq \mathbb{R}^{J}$ and $y_i \in \mathcal{Y} \subseteq \mathbb{R}$ and mapping, $\widehat{\Psi} : \mathcal{L}_n \to \left{f \mid f : \mathcal{X} \to \mathcal{Y}\right}$. We call $\widehat{\Psi}$ an estimation procedure and the function $\psi_n = \widehat{\Psi}(\mathcal{L}_n)$ an estimator. Note that since we're in the world of probability and statistics, the $x$'s and $y$'s are realizations of random variables, and so for a fixed $n$, your learning set, $\mathcal{L}_n$ is also a realization of a random object. Hence, the estimation procedure is actually a function on the space of learning sets.
Technicalities aside, the most profitable way of thinking about estimation procedures ($\widehat{\Psi}$) is that they are black-box algorithms that spit out functions ($\psi_n$) which can take predictors like $x_i$ and spit out predictions, $\hat{y}_i$.
boostr
This is all well and good, but how does this apply to you, the boostr
user? Well, boostr
lets you use your own estimation procedures in boostr::boost
. However, to do so, boostr::boost
needs to make sure the object you're claiming to be an estimation procedure is, infact, an estimation procedure.
A priori, boostr
assumes that all estimation procedures:
(data, ...)
where data
represents the learning set $\mathcal{L}_n$,(newdata, ...)
, where newdata
represents
the $x$'s whose $y$'s are to be predicted, andestimationProcedure
.The last detail is just a minor detail; the first two requirements are more important. Basically, if you can rewrite the signature of your estimation procedure's signature to match (data, ...)
, and it's output's signature to match (newdata, ...)
, boostr::boost
can Boost it. However, boostr::boost
doesn't do this with black-magic, it needs to know information about your estimation procedure. Specifically, boostr::boost
has an argument, metadata
,
which is a named list of arguments to pass to boostr
Wrapper Generators written for the express purpose of taking things like your estimation procedure, and creating objects whose signatures and output are compatible inside boostr
.
For estimation procedures, the relevant Wrapper Generators are boostr::wrapProcedure
and boostr::buildEstimationProcedure
-- when boostr::boost
calls them depends entire on the x
argument to boostr::boost
. Ignoring this caveat for a moment, let's consider what we would have to do turn kNN_EstimationProcedure
in the $k$-NN example into a boostr
-compatible estimation procedure. First, its signature is (k, learningSet)
, so we'd want a wrapper function(data, ...)
where data
corresponds to learningSet
and then have ...
take care of k
. boostr
can build this for you, if you include the entry learningSet="learningInput"
in the metadata
entry of boostr::boost
and pass the value of k
in as a named entry in .procArgs
-- see this example where kNN_EstimationProcedure
is boosted according to the arc-x4 algorithm. Since we're wrapping a whole procedure, and not a closure that combines the train-predict pattern (like in the svm example), the metadata
arguments we'll want to use are the arguments corresponding to boostr::wrapProcedure
. See the help page for the details on boostr::wrapProcedure
's signature.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.