Binary and Multiway Splits

Share:

Description

A class for representing multiway splits and functions for computing on splits.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
partysplit(varid, breaks = NULL, index = NULL, right = TRUE, 
    prob = NULL, info = NULL)
kidids_split(split, data, vmatch = 1:ncol(data), obs = NULL)
character_split(split, data = NULL, 
    digits = getOption("digits") - 2)
varid_split(split)
breaks_split(split)
index_split(split)
right_split(split)
prob_split(split)
info_split(split)

Arguments

varid

an integer specifying the variable to split in, i.e., a column number in data.

breaks

a numeric vector of split points.

index

an integer vector containing a contiguous sequence from one to the number of kid nodes. May contain NAs.

right

a logical, indicating if the intervals defined by breaks should be closed on the right (and open on the left) or vice versa.

prob

a numeric vector representing a probability distribution over kid nodes.

info

additional information.

split

an object of class partysplit.

data

a list or data.frame.

vmatch

a permutation of the variable numbers in data.

obs

a logical or integer vector indicating a subset of the observations in data.

digits

minimal number of significant digits.

Details

A split is basically a function that maps data, more specifically a partitioning variable, to a set of integers indicating the kid nodes to send observations to. Objects of class partysplit describe such a function and can be set-up via the partysplit() constructor. The variables are available in a list or data.frame (here called data) and varid specifies the partitioning variable, i.e., the variable or list element to split in. The constructor partysplit() doesn't have access to the actual data, i.e., doesn't estimate splits.

kidids_split(split, data) actually partitions the data data[obs,varid_split(split)] and assigns an integer (giving the kid node number) to each observation. If vmatch is given, the variable vmatch[varid_split(split)] is used.

character_split() returns a character representation of its split argument. The remaining functions defined here are accessor functions for partysplit objects.

The numeric vector breaks defines how the range of the partitioning variable (after coercing to a numeric via as.numeric) is divided into intervals (like in cut) and may be NULL. These intervals are represented by the numbers one to length(breaks) + 1.

index assigns these length(breaks) + 1 intervals to one of at least two kid nodes. Thus, index is a vector of integers where each element corresponds to one element in a list kids containing partynode objects, see partynode for details. The vector index may contain NAs, in that case, the corresponding values of the splitting variable are treated as missings (for example factor levels that are not present in the learning sample). Either breaks or index must be given. When breaks is NULL, it is assumed that the partitioning variable itself has storage mode integer (e.g., is a factor).

prob defines a probability distribution over all kid nodes which is used for random splitting when a deterministic split isn't possible (due to missing values, for example).

info takes arbitrary user-specified information.

Value

The constructor partysplit() returns an object of class partysplit:

varid

an integer specifying the variable to split in, i.e., a column number in data,

breaks

a numeric vector of split points,

index

an integer vector containing a contiguous sequence from one to the number of kid nodes,

right

a logical, indicating if the intervals defined by breaks should be closed on the right (and open on the left) or vice versa

prob

a numeric vector representing a probability distribution over kid nodes,

info

additional information.

kidids_split() returns an integer vector describing the partition of the observations into kid nodes.

character_split() gives a character representation of the split and the remaining functions return the corresponding slots of partysplit objects.

References

Hothorn T, Zeileis A (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905–3909.

See Also

cut

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
data("iris", package = "datasets")

## binary split in numeric variable `Sepal.Length'
sl5 <- partysplit(which(names(iris) == "Sepal.Length"),
    breaks = 5)
character_split(sl5, data = iris)
table(kidids_split(sl5, data = iris), iris$Sepal.Length <= 5)

## multiway split in numeric variable `Sepal.Width', 
## higher values go to the first kid, smallest values 
## to the last kid
sw23 <- partysplit(which(names(iris) == "Sepal.Width"),    
    breaks = c(3, 3.5), index = 3:1)	
character_split(sw23, data = iris)    
table(kidids_split(sw23, data = iris), 
    cut(iris$Sepal.Width, breaks = c(-Inf, 2, 3, Inf)))   

## binary split in factor `Species'
sp <- partysplit(which(names(iris) == "Species"),
    index = c(1L, 1L, 2L))
character_split(sp, data = iris)
table(kidids_split(sp, data = iris), iris$Species)

## multiway split in factor `Species'
sp <- partysplit(which(names(iris) == "Species"), index = 1:3)
character_split(sp, data = iris)
table(kidids_split(sp, data = iris), iris$Species)

## multiway split in numeric variable `Sepal.Width'
sp <- partysplit(which(names(iris) == "Sepal.Width"), 
    breaks = quantile(iris$Sepal.Width))
character_split(sp, data = iris)