partysplit: Binary and Multiway Splits In partykit: A Toolkit for Recursive Partytioning

Description

A class for representing multiway splits and functions for computing on splits.

Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11``` ```partysplit(varid, breaks = NULL, index = NULL, right = TRUE, prob = NULL, info = NULL) kidids_split(split, data, vmatch = 1:ncol(data), obs = NULL) character_split(split, data = NULL, digits = getOption("digits") - 2) varid_split(split) breaks_split(split) index_split(split) right_split(split) prob_split(split) info_split(split) ```

Arguments

 `varid` an integer specifying the variable to split in, i.e., a column number in `data`. `breaks` a numeric vector of split points. `index` an integer vector containing a contiguous sequence from one to the number of kid nodes. May contain `NA`s. `right` a logical, indicating if the intervals defined by `breaks` should be closed on the right (and open on the left) or vice versa. `prob` a numeric vector representing a probability distribution over kid nodes. `info` additional information. `split` an object of class `partysplit`. `data` a `list` or `data.frame`. `vmatch` a permutation of the variable numbers in `data`. `obs` a logical or integer vector indicating a subset of the observations in `data`. `digits` minimal number of significant digits.

Details

A split is basically a function that maps data, more specifically a partitioning variable, to a set of integers indicating the kid nodes to send observations to. Objects of class `partysplit` describe such a function and can be set-up via the `partysplit()` constructor. The variables are available in a `list` or `data.frame` (here called `data`) and `varid` specifies the partitioning variable, i.e., the variable or list element to split in. The constructor `partysplit()` doesn't have access to the actual data, i.e., doesn't estimate splits.

`kidids_split(split, data)` actually partitions the data `data[obs,varid_split(split)]` and assigns an integer (giving the kid node number) to each observation. If `vmatch` is given, the variable `vmatch[varid_split(split)]` is used.

`character_split()` returns a character representation of its `split` argument. The remaining functions defined here are accessor functions for `partysplit` objects.

The numeric vector `breaks` defines how the range of the partitioning variable (after coercing to a numeric via `as.numeric`) is divided into intervals (like in `cut`) and may be `NULL`. These intervals are represented by the numbers one to `length(breaks) + 1`.

`index` assigns these `length(breaks) + 1` intervals to one of at least two kid nodes. Thus, `index` is a vector of integers where each element corresponds to one element in a list `kids` containing `partynode` objects, see `partynode` for details. The vector `index` may contain `NA`s, in that case, the corresponding values of the splitting variable are treated as missings (for example factor levels that are not present in the learning sample). Either `breaks` or `index` must be given. When `breaks` is `NULL`, it is assumed that the partitioning variable itself has storage mode `integer` (e.g., is a `factor`).

`prob` defines a probability distribution over all kid nodes which is used for random splitting when a deterministic split isn't possible (due to missing values, for example).

`info` takes arbitrary user-specified information.

Value

The constructor `partysplit()` returns an object of class `partysplit`:

 `varid` an integer specifying the variable to split in, i.e., a column number in `data`, `breaks` a numeric vector of split points, `index` an integer vector containing a contiguous sequence from one to the number of kid nodes, `right` a logical, indicating if the intervals defined by `breaks` should be closed on the right (and open on the left) or vice versa `prob` a numeric vector representing a probability distribution over kid nodes, `info` additional information.

`kidids_split()` returns an integer vector describing the partition of the observations into kid nodes.

`character_split()` gives a character representation of the split and the remaining functions return the corresponding slots of `partysplit` objects.

References

Hothorn T, Zeileis A (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905–3909.

`cut`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32``` ```data("iris", package = "datasets") ## binary split in numeric variable `Sepal.Length' sl5 <- partysplit(which(names(iris) == "Sepal.Length"), breaks = 5) character_split(sl5, data = iris) table(kidids_split(sl5, data = iris), iris\$Sepal.Length <= 5) ## multiway split in numeric variable `Sepal.Width', ## higher values go to the first kid, smallest values ## to the last kid sw23 <- partysplit(which(names(iris) == "Sepal.Width"), breaks = c(3, 3.5), index = 3:1) character_split(sw23, data = iris) table(kidids_split(sw23, data = iris), cut(iris\$Sepal.Width, breaks = c(-Inf, 2, 3, Inf))) ## binary split in factor `Species' sp <- partysplit(which(names(iris) == "Species"), index = c(1L, 1L, 2L)) character_split(sp, data = iris) table(kidids_split(sp, data = iris), iris\$Species) ## multiway split in factor `Species' sp <- partysplit(which(names(iris) == "Species"), index = 1:3) character_split(sp, data = iris) table(kidids_split(sp, data = iris), iris\$Species) ## multiway split in numeric variable `Sepal.Width' sp <- partysplit(which(names(iris) == "Sepal.Width"), breaks = quantile(iris\$Sepal.Width)) character_split(sp, data = iris) ```