PRIM: Combined Function for the Patient Rule Induction Method...

Description Usage Arguments Details Value References See Also Examples

Description

This function is a automated implementation of PRIM as suggested by Friedman and Fisher (1999). It includes multiple peeling (PRIM_peel_bs), pasting (PRIM_paste) and the covering stretegy to find more than one box.

Usage

1
2
3
4
PRIM(formula, data, f_min, beta_min = 0.2, max_boxes = Inf,
  peel_alpha = seq(0.01, 0.4, 0.03), B = 0, target = mean,
  alter_crit = TRUE, use_NAs = TRUE, seed, print_position = FALSE,
  paste_alpha = 0.01, max_steps = 50, stop_by_dec = TRUE)

Arguments

formula

an object of class "formula" with a response but no interaction terms. It indicates the response over which the target function should be maximized and the covariates that are used for the later box definitions.

data

an object of class data.frame containing the variables named in the formula.

f_min

minimum target the final box must have. From all boxes, that fulfill this criterion, the one with the biggest support is taken after the peeling. If this argument is missing the box with the biggest target having at least a support of beta_min is taken.

beta_min

minimum support that one box must have. This proportion always refers to the whole data set.

max_boxes

maximum number of boxes to be found.

peel_alpha

vector of a sequence of different alpha-fractions used for the peelings.

B

number of bootstrap samples on which the peeling is applied to for each alpha. For B = 0 no bootstraps are created.

target

target-function to be maximized. In most cases the mean is a useful target, although other functions like e.g. the median are also possible here.

alter_crit

logical. If TRUE the alternative criterion is used for peeling. I.e. "target/beta" is maximized during peeling instead of "target", so that large subboxes are not prefered to be peeled off. This is important especially in case of nominal covariates.

use_NAs

logical. If TRUE observations with missing values are included in the analysis.

seed

seed to be set before the first iteration. Only useful for B > 0.

print_position

logical. If TRUE the current position of the algorithm is printed out.

paste_alpha

alpha-fraction that is pasted to the box at each pasting step

max_steps

maximum number of pasting steps the function should make.

stop_by_dec

logical. If TRUE the pasting stops if the target at one step is lower than the target of the last step.

Details

This function repeats the peeling and pasting algorithm for the same settings of the metaparameters until a stop ctiterion is reached. After each iteration the observations already included in a box are removed from the data, on which the next box is built. This strategy is called covering. This iteration stops if either max_boxes is reached or if the target function of the "best" box is lower than the overall target.

In each iteration step this function does a multiple peeling characterized by the sequenz alpha_peel and B. From the peeling output the box defined by beta_min and f_min is chosen. After that the pasting function seeks for boxes with bigger supports and bigger targets and takes the one with the highest target function within the box.

The function can also cope with survival outcomes (Surv-object). Therefore the hazard rate is used as target function as suggested in Ott and Hapfelmeier (2017). The value of the input parameter target is ignored in this case.

Value

PRIM returns an object of class "prim", which is a list containing the following components:

f

vector of the target functions evaluated on each box. The last element is the target of all observations not lying in a box.

beta

vector of the supports of each box. The last element is the fraction of observations not lying in a box.

box

a data.frame defining the borders of the boxes. Each row belongs to one box. The columns with "min." and "max." describe the lower and upper boundaries of the at least ordinal covariates. Therefore the value taken is the last one that is not included in the current box.

For the nominal variables there are columns for every category they can take. If the category is removed from the box the value FALSE is taken. The names of these columns are structured like: <variable name>.<category>

For each variable with missing values (only if use_NAs = TRUE) there is also a column taking the value FALSE if the NAs of this variable are removed from the current box. The names of these columns are structured like: <variable name>.NA

box_metric, box_nom, box_na

easier to handle definitions of the boxes for other functions

subsets

list of logical vectors indicating the subsets (i.e. the observations that lie in each box)

fixboxes

list of all fixbox'es defining the final boxes.

data_orig

original dataset that is used.

References

Friedman, J. H. and Fisher, N. I., 'Bump hunting in high-dimensional data', Statistics and Computing 9 (2) (1999), 123-143

Ott, A. and Hapfelmeier, A., 'Nonparametric Subgroup Identification by PRIM and CART: A Simulation and Application Study', Computational and Mathematical Methods in Medicine, vol. 2017 (2017), 17 pages, Article ID 5271091

See Also

PRIM_peel_bs, PRIM_paste, define_fixbox

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# generating random data:
set.seed(123)
n <- 500
x1 <- runif(n = n, min = -1)
x2 <- runif(n = n, min = -1)
x3 <- runif(n = n, min = -1)
cat <- as.factor(sample(c("a","b","c", "d"), size = n, replace = TRUE))
wsk <- (1-sqrt(x1^2+x2^2)/sqrt(2))
y <- as.logical(rbinom(n = n, prob = wsk, size = 1))
dat <- cbind.data.frame(y, x1, x2, x3, cat)
#plot(dat$x1, dat$x2, col=dat$y+1, pch=16)
remove(x1, x2, x3, y, wsk, cat, n)

# apply the PRIM function to find the best boxes with a support of at least 0.1:
p <- PRIM(y~., data=dat, beta_min = 0.1, max_boxes = 3)
p

ao90/PRIM documentation built on May 5, 2019, 8:01 p.m.