peeling: Top-down peeling
In PierreMasselot/primr: Patient rule-induction method

Description Usage Arguments Details Value References See Also Examples

View source: R/peeling.R

Iteratively peels a dataset for bump hunting.

1	peeling(y, x, alpha = 0.05, beta.stop = 0.01, obj.fun = mean, peeling.side = 0)

`y`	Numeric vector of response values.
`x`	Numeric or categorical data.frame of input values.
`alpha`	The peeling fraction of the algorithm. A value between 0 and 1 giving the proportion of peeled observations at each step.
`beta.stop`	The stopping support of the algorithm. A value between 0 and 1 giving the proportion of remaining data below which the algorithm stops.
`obj.fun`	The function of `y` to be maximized. Can be a user defined function (see details).
`peeling.side`	A numeric vector for side constraints on the peeling of each input variable. -1 indicates peeling only the 'left' of the box (i.e. increasing the lower limit only), 1 indicate peeling only the 'right' and 0 for no constraint.

The function peeling carries out the top-down peeling which is the first step of the PRIM algorithm. At each iteration it peels a proportion alpha of data from one side of the domain in order to increase the value of the function obj.fun applied to the response y. The algorithm iterates the peeling until the support of the box (i.e. the proportion of remaining observations) is below the value beta.stop.

Many function can be used in obj.fun including user defined functions. User defined function should take two arguments: y and x representing corresponding variables and inbox which is a boolean vector indicating the observations inside the current box. Note that a classical function can also be passed to obj.fun such as mean, var or median. In this case the function is created internally to fit the above structure. For more functions more complicated than the basic ones, it is recommended that the user set its own function as stated above.

The function also allows directed peeling, i.e. to contraint the peeling occuring on a single side of some input variables. Thus when peeling.side = -1, only the lower part of the variable is peeled (the "left" of the domain) and when peeling.side = 1, only the upper part of the variable is peeled. Note that a vector can be passed, thus applying different constraints to the input variables.

A prim object which is a list with the following elements:

`npeel`	The number of peeling iteration performed.
`support`	A vector of length `npeel + 1` containing the support of each successivepeeled box.
`yfun`	A vector of length `npeel + 1` containing the objective function value of each successive peeled box.
`limits`	A list of length `npeel + 1` containing the limits of each successive box. Each limit is a list with one element per input variable.
`x,y`	The input and response data used in the algorithm.
`numeric.vars`	A logical vector indicating, for each input variable, if it was considered as a numeric variable.
`alpha, peeling.side, obj.fun`	The value of the arguments used for peeling. Useful for prim methods.
`npaste`	Number of pasting iteration performed. Should be 0 here, but useful for `pasting`.

Note that the first box in a prim object is the starting box containing the whole dataset. This is why the limits, yfun and support elements have length npeel + 1.

Friedman, J.H., Fisher, N.I., 1999. Bump hunting in high-dimensional data. Statistics and Computing 9, 123-143. https://doi.org/10.1023/A:1008894516817

extract.box to extract information about a particular box in a prim object. plot_trajectory and plot_box to explore the peeling trajectory. jump.prim to automatically choose the best box. predict.prim to predict if new data falls into particular boxes. pasting to carry out the pasting refining the edges of the chosen box.

   # A simple bump
   set.seed(12345)
   x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2")))
   y <- 2 * x[,1] + 5 * x[,2] + 10 * (x[,1] >= .8 & x[,2] >= .5) + 
     rnorm(1000)
   # Peeling with alpha = 0.05 and beta.stop = 0.05
   peel_res <- peeling(y, x, beta.stop = 0.05)
   # Automatically choose the best box
   chosen <- jump.prim(peel_res)
   # Plot the resulting box
   plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), 
     support = chosen$final.box$support, box.args = list(lwd = 2))

   # Examples of directed peeling
   set.seed(12345)
   x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2")))
   y <- 10 * (x[,1] <= .2 & x[,2] <= .2) + 10 * (x[,1] >= .8 & x[,2] >= .8) +
     rnorm(1000)
   # Left peeling
   peel_left <- peeling(y, x, peeling.side = -1)
   chosen <- jump.prim(peel_left)
   plot_box(peel_left, pch = 16, ypalette = hcl.colors(10), 
     support = chosen$final.box$support, box.args = list(lwd = 2),
     main = "Left peeling")
   # Right peeling
   peel_right <- peeling(y, x, peeling.side = 1)
   chosen <- jump.prim(peel_right)
   plot_box(peel_right, pch = 16, ypalette = hcl.colors(10), 
     support = chosen$final.box$support, box.args = list(lwd = 2),
     main = "Right peeling")

   # User-defined objective function to minimize the mean
   set.seed(3333)
   x <- matrix(runif(2000), ncol = 2, dimnames = list(NULL, c("x1", "x2")))
   y <- - 10 * (x[,1] <= .2 & x[,2] <= .2) + 10 * (x[,1] >= .8 & x[,2] >= .8) +
     rnorm(1000)
   peel_res <- peeling(y, x, obj.fun = function(x) -mean(x))
   chosen <- jump.prim(peel_res)
   plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), 
     support = chosen$final.box$support, box.args = list(lwd = 2))

   # User-defined function maximizing the slope of a linear regression
   set.seed(5555)
   x <- runif(500)
   ym <- 0.5 * x + 5 * (x - 0.7) * (x >= 0.7)
   y <- ym + rnorm(500, sd = 0.1)    
   peel_res <- peeling(y, x, beta.stop = 0.1, 
     obj.fun = function(y, x, inbox){
       dat <- data.frame(y, x)
       coef(lm(y ~ x, data = dat[inbox,]))[2]
   })   
   par(mfrow = c(1,2))
   plot_trajectory(peel_res, type = "b", pch = 16, col = "cornflowerblue", 
     support = 0.3, abline.pars = list(lwd = 2, col = "indianred"))
   plot_box(peel_res, pch = 16, ypalette = hcl.colors(10), 
     support = 0.3, box.args = list(lwd = 2))
   lines(sort(x), ym[order(x)], col = "red", lwd = 2)