pre  R Documentation 
Function pre
derives a sparse ensemble of rules and/or linear functions for
prediction of a continuous, binary, count, multinomial, multivariate
continuous or survival response.
pre( formula, data, family = gaussian, use.grad = TRUE, weights, type = "both", sampfrac = 0.5, maxdepth = 3L, learnrate = 0.01, mtry = Inf, ntrees = 500, confirmatory = NULL, singleconditions = FALSE, winsfrac = 0.025, normalize = TRUE, standardize = FALSE, ordinal = TRUE, nfolds = 10L, tree.control, tree.unbiased = TRUE, removecomplements = TRUE, removeduplicates = TRUE, verbose = FALSE, par.init = FALSE, par.final = FALSE, sparse = FALSE, ... )
formula 
a symbolic description of the model to be fit of the form

data 

family 
specifies a glm family object. Can be a character string (i.e.,

use.grad 
logical. Should gradient boosting with regression trees be
employed when 
weights 
optional vector of observation weights to be used for deriving the ensemble. 
type 
character. Specifies type of base learners to include in the
ensemble. Defaults to 
sampfrac 
numeric value > 0 and ≤ 1. Specifies
the fraction of randomly selected training observations used to produce each
tree. Values < 1 will result in sampling without replacement (i.e.,
subsampling), a value of 1 will result in sampling with replacement
(i.e., bootstrap sampling). Alternatively, a sampling function may be supplied,
which should take arguments 
maxdepth 
positive integer. Maximum number of conditions in rules.
If 
learnrate 
numeric value > 0. Learning rate or boosting parameter. 
mtry 
positive integer. Number of randomly selected predictor variables for
creating each split in each tree. Ignored when 
ntrees 
positive integer value. Number of trees to generate for the initial ensemble. 
confirmatory 
character vector. Specifies one or more confirmatory terms
to be included in the final ensemble. Linear terms can be specified as the
name of a predictor variable included in 
singleconditions 

winsfrac 
numeric value > 0 and ≤ 0.5. Quantiles of data distribution to be used for winsorizing linear terms. If set to 0, no winsorizing is performed. Note that ordinal variables are included as linear terms in estimating the regression model and will also be winsorized. 
normalize 
logical. Normalize linear variables before estimating the
regression model? Normalizing gives linear terms the same a priori influence
as a typical rule, by dividing the (winsorized) linear term by 2.5 times its
SD. 
standardize 
logical. Should rules and linear terms be standardized to
have SD equal to 1 before estimating the regression model? This will also
standardize the dummified factors, users are advised to use the default

ordinal 
logical. Should ordinal variables (i.e., ordered factors) be
treated as continuous for generating rules? 
nfolds 
positive integer. Number of crossvalidation folds to be used for selecting the optimal value of the penalty parameter λ in selecting the final ensemble. 
tree.control 
list with control parameters to be passed to the tree
fitting function, generated using 
tree.unbiased 
logical. Should an unbiased tree generation algorithm
be employed for rule generation? Defaults to 
removecomplements 
logical. Remove rules from the ensemble which are identical to (1  an earlier rule)? 
removeduplicates 
logical. Remove rules from the ensemble which are identical to an earlier rule? 
verbose 
logical. Should progress be printed to the command line? 
par.init 
logical. Should parallel 
par.final 
logical. Should parallel 
sparse 
logical. Should sparse design matrices be used? May improve computation times for large datasets. 
... 
Further arguments to be passed to

Note: obervations with missing values will be removed prior to analysis (and a warning printed).
In some cases, duplicated variable names may appear in the model. For example, the first variable is a factor named 'V1' and there are also variables named 'V10' and/or 'V11' and/or 'V12' (etc). Then for for the binary factor V1, dummy contrast variables will be created, named 'V10', 'V11', 'V12' (etc). As should be clear from this example, this yields duplicated variable names, which may yield problems, for example in the calculation of predictions and importances, later on. This can be prevented by renaming factor variables with numbers in their name, prior to analysis.
The table below provides an overview of combinations of response
variable types, use.grad
, tree.unbiased
and
learnrate
settings that are supported, and the tree induction
algorithm that will be employed as a result:
use.grad  tree.unbiased  learnrate  family  tree alg.  Response variable format 
TRUE  TRUE  0  gaussian  ctree  Single, numeric (noninteger) 
TRUE  TRUE  0  mgaussian  ctree  Multiple, numeric (noninteger) 
TRUE  TRUE  0  binomial  ctree  Single, factor with 2 levels 
TRUE  TRUE  0  multinomial  ctree  Single, factor with >2 levels 
TRUE  TRUE  0  poisson  ctree  Single, integer 
TRUE  TRUE  0  cox  ctree  Object of class 'Surv' 
TRUE  TRUE  >0  gaussian  ctree  Single, numeric (noninteger) 
TRUE  TRUE  >0  mgaussian  ctree  Multiple, numeric (noninteger) 
TRUE  TRUE  >0  binomial  ctree  Single, factor with 2 levels 
TRUE  TRUE  >0  multinomial  ctree  Single, factor with >2 levels 
TRUE  TRUE  >0  poisson  ctree  Single, integer 
TRUE  TRUE  >0  cox  ctree  Object of class 'Surv' 
FALSE  TRUE  0  gaussian  glmtree  Single, numeric (noninteger) 
FALSE  TRUE  0  binomial  glmtree  Single, factor with 2 levels 
FALSE  TRUE  0  poisson  glmtree  Single, integer 
FALSE  TRUE  >0  gaussian  glmtree  Single, numeric (noninteger) 
FALSE  TRUE  >0  binomial  glmtree  Single, factor with 2 levels 
FALSE  TRUE  >0  poisson  glmtree  Single, integer 
TRUE  FALSE  0  gaussian  rpart  Single, numeric (noninteger) 
TRUE  FALSE  0  binomial  rpart  Single, factor with 2 levels 
TRUE  FALSE  0  multinomial  rpart  Single, factor with >2 levels 
TRUE  FALSE  0  poisson  rpart  Single, integer 
TRUE  FALSE  0  cox  rpart  Object of class 'Surv' 
TRUE  FALSE  >0  gaussian  rpart  Single, numeric (noninteger) 
TRUE  FALSE  >0  binomial  rpart  Single, factor with 2 levels 
TRUE  FALSE  >0  poisson  rpart  Single, integer 
TRUE  FALSE  >0  cox  rpart  Object of class 'Surv' 
If an error along the lines of 'factor ... has new levels ...' is encountered,
consult ?rare_level_sampler
for explanation and solutions.
An object of class pre
. It contains the initial ensemble of
rules and/or linear terms and a range of possible final ensembles.
By default, the final ensemble employed by all other
methods and functions in package pre
is selected using the 'minimum
cross validated error plus 1 standard error' criterion. All functions and
methods for objects of class pre
take a penalty.parameter.val
argument, which can be used to select a different criterion.
Parts of the code for deriving rules from the nodes of trees was copied
with permission from an internal function of the partykit
package, written
by Achim Zeileis and Torsten Hothorn.
Fokkema, M. (2020). Fitting prediction rule ensembles with R package pre. Journal of Statistical Software, 92(12), 130. doi: 10.18637/jss.v092.i12
Fokkema, M. & Strobl, C. (2020). Fitting prediction rule ensembles to psychological research data: An introduction and tutorial. Psychological Methods 25(5), 636652. doi: 10.1037/met0000256, https://arxiv.org/abs/1907.05302
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Applied Statistics, 29(5), 11891232.
Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), 916954, doi: 10.1214/07AOAS148.
Hothorn, T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16, 39053909.
print.pre
, plot.pre
,
coef.pre
, importance.pre
, predict.pre
,
interact
, cvpre
## Fit pre to a continuous response: airq < airquality[complete.cases(airquality), ] set.seed(42) airq.ens < pre(Ozone ~ ., data = airq) airq.ens ## Fit pre to a binary response: airq2 < airquality[complete.cases(airquality), ] airq2$Ozone < factor(airq2$Ozone > median(airq2$Ozone)) set.seed(42) airq.ens2 < pre(Ozone ~ ., data = airq2, family = "binomial") airq.ens2 ## Fit pre to a multivariate continuous response: airq3 < airquality[complete.cases(airquality), ] set.seed(42) airq.ens3 < pre(Ozone + Wind ~ ., data = airq3, family = "mgaussian") airq.ens3 ## Fit pre to a multinomial response: set.seed(42) iris.ens < pre(Species ~ ., data = iris, family = "multinomial") iris.ens ## Fit pre to a survival response: library("survival") lung < lung[complete.cases(lung), ] set.seed(42) lung.ens < pre(Surv(time, status) ~ ., data = lung, family = "cox") lung.ens ## Fit pre to a count response: ## Generate random data (partly based on Dobson (1990) Page 93: Randomized ## Controlled Trial): counts < rep(as.integer(c(18, 17, 15, 20, 10, 20, 25, 13, 12)), times = 10) outcome < rep(gl(3, 1, 9), times = 10) treatment < rep(gl(3, 3), times = 10) noise1 < 1:90 set.seed(1) noise2 < rnorm(90) countdata < data.frame(treatment, outcome, counts, noise1, noise2) set.seed(42) count.ens < pre(counts ~ ., data = countdata, family = "poisson") count.ens
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.