Recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework.
1 2 3 
formula 
a symbolic description of the model to be fit. 
data 
a data frame containing the variables in the model. 
subset 
an optional vector specifying a subset of observations to be used in the fitting process. 
weights 
an optional vector of weights to be used in the fitting process. Only nonnegative integer valued weights are allowed. 
offset 
an optional vector of offset values. 
cluster 
an optional factor indicating independent clusters. Highly experimental, use at your own risk. 
na.action 
a function which indicates what should happen when the data contain missing value. 
control 
a list with control parameters, see

ytrafo 
an optional named list of functions to be applied to the response
variable(s) before testing their association with the explanatory
variables. Note that this transformation is only
performed once for the root node and does not take weights into account.
Alternatively, 
converged 
an optional function for checking userdefined criteria before splits are implemented. This is not to be used and very likely to change. 
scores 
an optional named list of scores to be attached to ordered factors. 
... 
arguments passed to 
Function partykit::ctree
is a reimplementation of (most of)
party::ctree
employing the new party
infrastructure
of the partykit infrastructure. Although the new code was already
extensively tested, it is not yet as mature as the old code. If you notice
differences in the structure/predictions of the resulting trees, please
contact the package maintainers. See also vignette("ctree", package = "partykit")
for some remarks about the internals of the different implementations.
Conditional inference trees estimate a regression relationship by binary recursive partitioning in a conditional inference framework. Roughly, the algorithm works as follows: 1) Test the global null hypothesis of independence between any of the input variables and the response (which may be multivariate as well). Stop if this hypothesis cannot be rejected. Otherwise select the input variable with strongest association to the response. This association is measured by a pvalue corresponding to a test for the partial null hypothesis of a single input variable and the response. 2) Implement a binary split in the selected input variable. 3) Recursively repeate steps 1) and 2).
The implementation utilizes a unified framework for conditional inference,
or permutation tests, developed by Strasser and Weber (1999). The stop
criterion in step 1) is either based on multiplicity adjusted pvalues
(testtype = "Bonferroni"
in ctree_control
)
or on the univariate pvalues (testtype = "Univariate"
). In both cases, the
criterion is maximized, i.e., 1  pvalue is used. A split is implemented
when the criterion exceeds the value given by mincriterion
as
specified in ctree_control
. For example, when
mincriterion = 0.95
, the pvalue must be smaller than
$0.05$ in order to split this node. This statistical approach ensures that
the rightsized tree is grown without additional (post)pruning or crossvalidation.
The level of mincriterion
can either be specified to be appropriate
for the size of the data set (and 0.95
is typically appropriate for
small to moderatelysized data sets) or could potentially be treated like a
hyperparameter (see Section~3.4 in Hothorn, Hornik and Zeileis, 2006).
The selection of the input variable to split in
is based on the univariate pvalues avoiding a variable selection bias
towards input variables with many possible cutpoints. The test statistics
in each of the nodes can be extracted with the sctest
method.
(Note that the generic is in the strucchange package so this either
needs to be loaded or sctest.constparty
has to be called directly.)
In cases where splitting stops due to the sample size (e.g., minsplit
or minbucket
etc.), the test results may be empty.
Predictions can be computed using predict
, which returns predicted means,
predicted classes or median predicted survival times and
more information about the conditional
distribution of the response, i.e., class probabilities
or predicted KaplanMeier curves. For observations
with zero weights, predictions are computed from the fitted tree
when newdata = NULL
.
By default, the scores for each ordinal factor x
are
1:length(x)
, this may be changed for variables in the formula
using scores = list(x = c(1, 5, 6))
, for example.
For a general description of the methodology see Hothorn, Hornik and Zeileis (2006) and Hothorn, Hornik, van de Wiel and Zeileis (2006).
An object of class party
.
Hothorn T, Hornik K, Van de Wiel MA, Zeileis A (2006). A Lego System for Conditional Inference. The American Statistician, 60(3), 257–263.
Hothorn T, Hornik K, Zeileis A (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Hothorn T, Zeileis A (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905–3909.
Strasser H, Weber C (1999). On the Asymptotic Theory of Permutation Statistics. Mathematical Methods of Statistics, 8, 220–250.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50  ### regression
airq < subset(airquality, !is.na(Ozone))
airct < ctree(Ozone ~ ., data = airq)
airct
plot(airct)
mean((airq$Ozone  predict(airct))^2)
### classification
irisct < ctree(Species ~ .,data = iris)
irisct
plot(irisct)
table(predict(irisct), iris$Species)
### estimated class probabilities, a list
tr < predict(irisct, newdata = iris[1:10,], type = "prob")
### survival analysis
if (require("TH.data") && require("survival") &&
require("coin") && require("Formula")) {
data("GBSG2", package = "TH.data")
(GBSG2ct < ctree(Surv(time, cens) ~ ., data = GBSG2))
predict(GBSG2ct, newdata = GBSG2[1:2,], type = "response")
plot(GBSG2ct)
### with weightdependent logrank scores
### logrank trafo for observations in this node only (= weights > 0)
h < function(formula, data, ...) {
f < Formula(formula)
mf < model.frame(formula = f, data = data)
s < model.part(f, data = mf, lhs = 1, rhs = 0)[[1]]
weights < 1:nrow(mf)
return(function(subset, ...) {
w < as.integer(weights)
w[subset] < 0L
s < logrank_trafo(s[w > 0,,drop = FALSE])
r < rep(0, nrow(mf))
r[w > 0] < s
list(estfun = matrix(as.double(r), ncol = 1), converged = TRUE)
})
}
### very much the same tree
(ctree(Surv(time, cens) ~ ., data = GBSG2, ytrafo = h))
}
### multivariate responses
airct2 < ctree(Ozone + Temp ~ ., data = airq)
airct2
plot(airct2)

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.
Please suggest features or report bugs with the GitHub issue tracker.
All documentation is copyright its authors; we didn't write any of that.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.