In PierreMasselot/primr: Patient rule-induction method

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(primr)

A basic toy example

Consider the following example, where a response is linearly linked to two explanatory variables. We create an artifical bump that occurs when $x_1 > 0.8$ and $x_2 > 0.5$:

set.seed(12345)
x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2")))
y <- 2 * x[,1] + 5 * x[,2] + 10 * (x[,1] >= .8 & x[,2] >= .5) + rnorm(1000)

plot(x, col = hcl.colors(10)[cut(y, 10)], pch = 16)

Peeling

The initial peeling can be carried out with the function peeling. This function allows to provide the peeling fraction alpha, the stopping criterion beta.stop, as well as the objective function being maximized and other parameters. See ?peeling.

# Peels the dataset to maximize the box's mean
peel_res <- peeling(y, x)

Choosing the best box

The next step is to examine the peeling trajectory in order to choose the best box. Several possibilities are available for the peeling trajectory. For instance, the plot below shows both the evolution of the mean and the relative difference of mean between each step.

par(mfrow = c(2,1), mar = c(4, 4, 3, 2) + .1)
plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", 
  support = 0.11, abline.pars = list(lwd = 2, col = "indianred"), 
  xlab = "")
plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue",
  support = 0.11, abline.pars = list(lwd = 2, col = "indianred"), 
  ytype = "rel.diff")

The peeling trajectory shows that the box's mean reaches a plateau when the support is lower than 0.11 (identified by the vertical line). This is also shown by a large relative difference in the bottom panel. This value of 0.11 is very close of the theoretical proportion of observations in the bump created ($0.2 \times 0.5 = 0.1$).

If we plot the corresponding box on the data, we see that it fits well the data.

plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), support = 0.11, 
  box.args = list(lwd = 2))

Pasting

The final step is to refine the box's edges by pasting. Let's say we choose the final box with support equal to 0.01. By pasting, it is possible to slightly increase the size of the final box, as shown in the figure below. It means that including the pasted observations increases the box's average.

paste_res <- pasting(peel_res, support = 0.01)

plot_box(paste_res, pch = 16, ypalette = hcl.colors(10), npaste = c(0, 2), 
  box.args = list(lwd = 2, border = c("grey", "black"), lty = 1:2))

Some special topics

The above example shows the overall framework to perform the PRIM algorithm. Here, we cover some additional features of the primr package.

Select the box after peeling

The most important part of the PRIM algorithm is the choice of the peeling's stopping criterion, i.e. which box to keep after the peeling has been carried out. The primr package offers a few tools to help the user in this choice.

The first one is to analyze the peeling trajectory as was done in the toy example above. In this example, we chose the highest relative difference of the trajectory. This can be automatically extracted by the function jump.prim. The example below shows that it allows finding the same box as before.

peel_res <- peeling(y, x, beta.stop = 0.05)
chosen <- jump.prim(peel_res)

plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), 
  support = chosen$final.box$support, box.args = list(lwd = 2))

Alternatively, the stopping criterion can be chosen through cross-validation with the function cv.trajectory. For instance, on the toy example below, the cross-validated trajectory also shows the same plateau as before for support lower than 0.1 precisely. It also reaches its maximum for the box with support equal to 0.032, although there is an important uncertainty.

cv_res <- cv.trajectory(y, x)

plot_trajectory(cv_res, type = "b", pch = 16, col = "cornflowerblue", 
  support = 0.1, npeel = which.max(cv_res$yfun), 
  abline.pars = list(lwd = 2, col = "indianred"), 
  xlab = "", xlim = c(0, 0.2), ylim = c(10, 18))

Once the peeling has been carried out and the appropriate support chosen, information about one or several boxes can be retrieved with the function extract.box. For instance, in the following it shows that the box with support equal to 0.1 results in limits slightly above the true ones (0.8 and 0.5).

extract.box(peel_res, support = 0.1)

Categorical variables

The primr package also supports categorical input variables. For instance, in the toy example above, we add a third 'letter' variable with a bump when it takes the value 'b'.

set.seed(54321)
x <- data.frame(x1 = runif(1000), x2 = runif(1000), 
  x3 = sample(letters[1:5], 1000, replace = TRUE))
y <- 2 * x[,1] + 5 * x[,2] + 10 * (x[,1] >= .8 & x[,2] >= .5) + 
  5 * (x[,3] == 'b') + rnorm(1000)

Below, we indeed find a box that roughly respects the bump limits, with the right support of 0.02 ($= 0.2 \times 0.5 \times 0.2$). Note that the peeling trajectory shows that the third variable was the first to be peeled with large chunks of data removed during the first four iterations.

peel_res <- peeling(y, x)

chosen <- jump.prim(peel_res)
chosen$final.box

plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", 
  support = chosen$final.box$support, 
  abline.pars = list(lwd = 2, col = "indianred"), xlab = "")

Directed peeling

Another feature of the primr package is the possibility to direct the peeling in a particular direction. Let's say that both input variables now induce bumps at both their extremities.

set.seed(12345)
x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2")))
y <- 10 * (x[,1] <= .2 & x[,2] <= .2) + 10 * (x[,1] >= .8 & x[,2] >= .8) +
  rnorm(1000)

plot(x, col = hcl.colors(10)[cut(y, 10)], pch = 16)

We can force the algorithm to peel only the left of the box to work toward the bump on the upper right corner.

peel_left <- peeling(y, x, peeling.side = -1)
chosen <- jump.prim(peel_left)

plot_box(peel_left, pch = 16, ypalette = hcl.colors(10), 
  support = chosen$final.box$support, box.args = list(lwd = 2),
  main = "Left peeling")

And conversely, we can direct the peeling to work toward the bump of the bottom left corner. Note that different input variables can be directed in different directions.

peel_right <- peeling(y, x, peeling.side = 1)
chosen <- jump.prim(peel_right)

plot_box(peel_right, pch = 16, ypalette = hcl.colors(10), 
  support = chosen$final.box$support, box.args = list(lwd = 2),
  main = "Right peeling")

Alternative objective functions

Although the mean is the most common objective function to maximize, other functions can be of interest. This can be carried out throught the argument obj.fun in the peeling function. For instance, in the following toy example, we work toward the high variance area of the data.

set.seed(1111)
x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2")))
y <- rnorm(1000, sd = 1 + 10 * (x[,1] >= .8 & x[,2] >= .8))

# Standard deviation peeling
peel_res <- peeling(y, x, obj.fun = sd)

par(mfrow = c(1,2))
plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", 
  support = 0.036, abline.pars = list(lwd = 2, col = "indianred"), xlab = "",
  xlim = c(0, 0.5))
plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), 
  support = 0.036, box.args = list(lwd = 2))

The argument obj.fun can also be used to minimize the objective function. The trick is simply to consider the opposite objective function.

set.seed(3333)
x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2")))
y <- - 10 * (x[,1] <= .2 & x[,2] <= .2) + 10 * (x[,1] >= .8 & x[,2] >= .8) +
  rnorm(1000)

peel_res <- peeling(y, x, obj.fun = function(y) -mean(y))
chosen <- jump.prim(peel_res)

plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), 
  support = chosen$final.box$support, box.args = list(lwd = 2))

Finally, more complex objective functions can also be passed to obj.fun. In more complex cases, the function should take two arguments: y representing the response vector, x which is the explanatory variable matrix and inbox which is a boolean vector indicating the observations inside the current box.

In the example below, we use a function to maximize the estimated coefficient of a linear regression with a break. The peeling trajectory reaches a plateau for supports lower than 0.3.

set.seed(5555)
x <- runif(500)
ym <- 0.5 * x + 5 * (x - 0.7) * (x >= 0.7)
y <- ym + rnorm(500, sd = 0.1)

# Apply to find where the slope is the steepest
peel_res <- peeling(y, x, beta.stop = 0.1, 
  obj.fun = function(y, x, inbox){
    dat <- data.frame(y, x)
    coef(lm(y ~ x, data = dat[inbox,]))[2]
})

par(mfrow = c(1,2))
plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", 
  support = 0.3, abline.pars = list(lwd = 2, col = "indianred"))
plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), 
  support = 0.3, box.args = list(lwd = 2))
lines(sort(x), ym[order(x)], col = "red", lwd = 2)