knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(primr)
Consider the following example, where a response is linearly linked to two explanatory variables. We create an artifical bump that occurs when $x_1 > 0.8$ and $x_2 > 0.5$:
set.seed(12345) x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2"))) y <- 2 * x[,1] + 5 * x[,2] + 10 * (x[,1] >= .8 & x[,2] >= .5) + rnorm(1000) plot(x, col = hcl.colors(10)[cut(y, 10)], pch = 16)
The initial peeling can be carried out with the function peeling
. This
function allows to provide the peeling fraction alpha
, the stopping
criterion beta.stop
, as well as the objective function being maximized
and other parameters. See ?peeling
.
# Peels the dataset to maximize the box's mean peel_res <- peeling(y, x)
The next step is to examine the peeling trajectory in order to choose the best box. Several possibilities are available for the peeling trajectory. For instance, the plot below shows both the evolution of the mean and the relative difference of mean between each step.
par(mfrow = c(2,1), mar = c(4, 4, 3, 2) + .1) plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", support = 0.11, abline.pars = list(lwd = 2, col = "indianred"), xlab = "") plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", support = 0.11, abline.pars = list(lwd = 2, col = "indianred"), ytype = "rel.diff")
The peeling trajectory shows that the box's mean reaches a plateau when the support is lower than 0.11 (identified by the vertical line). This is also shown by a large relative difference in the bottom panel. This value of 0.11 is very close of the theoretical proportion of observations in the bump created ($0.2 \times 0.5 = 0.1$).
If we plot the corresponding box on the data, we see that it fits well the data.
plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), support = 0.11, box.args = list(lwd = 2))
The final step is to refine the box's edges by pasting. Let's say we choose the final box with support equal to 0.01. By pasting, it is possible to slightly increase the size of the final box, as shown in the figure below. It means that including the pasted observations increases the box's average.
paste_res <- pasting(peel_res, support = 0.01) plot_box(paste_res, pch = 16, ypalette = hcl.colors(10), npaste = c(0, 2), box.args = list(lwd = 2, border = c("grey", "black"), lty = 1:2))
The above example shows the overall framework to perform the PRIM algorithm.
Here, we cover some additional features of the primr
package.
The most important part of the PRIM algorithm is the choice of the peeling's
stopping criterion, i.e. which box to keep after the peeling has been
carried out. The primr
package offers a few tools to help the user in this
choice.
The first one is to analyze the peeling trajectory as was done in the toy
example above. In this example, we chose the highest relative difference of
the trajectory. This can be automatically extracted by the function
jump.prim
. The example below shows that it allows finding the same box as
before.
peel_res <- peeling(y, x, beta.stop = 0.05) chosen <- jump.prim(peel_res) plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), support = chosen$final.box$support, box.args = list(lwd = 2))
Alternatively, the stopping criterion can be chosen through cross-validation
with the function cv.trajectory
. For instance, on the toy example below,
the cross-validated trajectory also shows the same plateau as before for
support lower than 0.1 precisely. It also
reaches its maximum for the box with support
equal to 0.032, although there is an important uncertainty.
cv_res <- cv.trajectory(y, x) plot_trajectory(cv_res, type = "b", pch = 16, col = "cornflowerblue", support = 0.1, npeel = which.max(cv_res$yfun), abline.pars = list(lwd = 2, col = "indianred"), xlab = "", xlim = c(0, 0.2), ylim = c(10, 18))
Once the peeling has been carried out and the appropriate support chosen,
information about one or several boxes can be retrieved with the function
extract.box
. For instance, in the following it shows that the box with
support equal to 0.1 results in limits slightly above the true ones
(0.8 and 0.5).
extract.box(peel_res, support = 0.1)
The primr
package also supports categorical input variables. For instance,
in the toy example above, we add a third 'letter' variable with a bump
when it takes the value 'b'.
set.seed(54321) x <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = sample(letters[1:5], 1000, replace = TRUE)) y <- 2 * x[,1] + 5 * x[,2] + 10 * (x[,1] >= .8 & x[,2] >= .5) + 5 * (x[,3] == 'b') + rnorm(1000)
Below, we indeed find a box that roughly respects the bump limits, with the right support of 0.02 ($= 0.2 \times 0.5 \times 0.2$). Note that the peeling trajectory shows that the third variable was the first to be peeled with large chunks of data removed during the first four iterations.
peel_res <- peeling(y, x) chosen <- jump.prim(peel_res) chosen$final.box plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", support = chosen$final.box$support, abline.pars = list(lwd = 2, col = "indianred"), xlab = "")
Another feature of the primr
package is the possibility to direct the
peeling in a particular direction.
Let's say that both input variables now induce bumps at both their extremities.
set.seed(12345) x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2"))) y <- 10 * (x[,1] <= .2 & x[,2] <= .2) + 10 * (x[,1] >= .8 & x[,2] >= .8) + rnorm(1000) plot(x, col = hcl.colors(10)[cut(y, 10)], pch = 16)
We can force the algorithm to peel only the left of the box to work toward the bump on the upper right corner.
peel_left <- peeling(y, x, peeling.side = -1) chosen <- jump.prim(peel_left) plot_box(peel_left, pch = 16, ypalette = hcl.colors(10), support = chosen$final.box$support, box.args = list(lwd = 2), main = "Left peeling")
And conversely, we can direct the peeling to work toward the bump of the bottom left corner. Note that different input variables can be directed in different directions.
peel_right <- peeling(y, x, peeling.side = 1) chosen <- jump.prim(peel_right) plot_box(peel_right, pch = 16, ypalette = hcl.colors(10), support = chosen$final.box$support, box.args = list(lwd = 2), main = "Right peeling")
Although the mean is the most common objective function to maximize, other
functions can be of interest. This can be carried out throught the argument
obj.fun
in the peeling
function. For instance, in the following toy
example, we work toward the high variance area of the data.
set.seed(1111) x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2"))) y <- rnorm(1000, sd = 1 + 10 * (x[,1] >= .8 & x[,2] >= .8)) # Standard deviation peeling peel_res <- peeling(y, x, obj.fun = sd) par(mfrow = c(1,2)) plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", support = 0.036, abline.pars = list(lwd = 2, col = "indianred"), xlab = "", xlim = c(0, 0.5)) plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), support = 0.036, box.args = list(lwd = 2))
The argument obj.fun
can also be used to minimize the objective function.
The trick is simply to consider the opposite objective function.
set.seed(3333) x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2"))) y <- - 10 * (x[,1] <= .2 & x[,2] <= .2) + 10 * (x[,1] >= .8 & x[,2] >= .8) + rnorm(1000) peel_res <- peeling(y, x, obj.fun = function(y) -mean(y)) chosen <- jump.prim(peel_res) plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), support = chosen$final.box$support, box.args = list(lwd = 2))
Finally, more complex objective functions can also be passed to obj.fun
.
In more complex cases, the function should take two arguments: y
representing the response vector, x
which is the explanatory variable matrix
and inbox
which is a boolean
vector indicating the observations inside the current box.
In the example below, we use a function to maximize the estimated coefficient of a linear regression with a break. The peeling trajectory reaches a plateau for supports lower than 0.3.
set.seed(5555) x <- runif(500) ym <- 0.5 * x + 5 * (x - 0.7) * (x >= 0.7) y <- ym + rnorm(500, sd = 0.1) # Apply to find where the slope is the steepest peel_res <- peeling(y, x, beta.stop = 0.1, obj.fun = function(y, x, inbox){ dat <- data.frame(y, x) coef(lm(y ~ x, data = dat[inbox,]))[2] }) par(mfrow = c(1,2)) plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", support = 0.3, abline.pars = list(lwd = 2, col = "indianred")) plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), support = 0.3, box.args = list(lwd = 2)) lines(sort(x), ym[order(x)], col = "red", lwd = 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.