pre: Derive a prediction rule ensemble
In pre: Prediction Rule Ensembles

View source: R/pre.R

pre	R Documentation

Derive a prediction rule ensemble

Description

Function pre derives a sparse ensemble of rules and/or linear functions for prediction of a continuous, binary, count, multinomial, multivariate continuous or survival response.

Usage

pre(
  formula,
  data,
  family = gaussian,
  ad.alpha = NA,
  ad.penalty = "lambda.min",
  use.grad = TRUE,
  weights,
  type = "both",
  sampfrac = 0.5,
  maxdepth = 3L,
  learnrate = 0.01,
  mtry = Inf,
  ntrees = 500,
  confirmatory = NULL,
  singleconditions = FALSE,
  winsfrac = 0.025,
  normalize = TRUE,
  standardize = FALSE,
  ordinal = TRUE,
  nfolds = 10L,
  tree.control,
  tree.unbiased = TRUE,
  removecomplements = TRUE,
  removeduplicates = TRUE,
  verbose = FALSE,
  par.init = FALSE,
  par.final = FALSE,
  sparse = FALSE,
  ...
)

Arguments

`formula`	a symbolic description of the model to be fit of the form `y ~ x1 + x2 + ... + xn`. Response (left-hand side of the formula) should be of class numeric (for `family = "gaussian"` or `"mgaussian"`), integer (for `family = "poisson"`), factor (for `family = "binomial"` or `"multinomial"`). See Examples below. Note that the minus sign (`-`) may not be used in the formula to omit the intercept or variables in `data`, and neither should `+ 0` be used to omit the intercept. To omit the intercept from the final ensemble, add `intercept = FALSE` to the call (although omitting the intercept from the final ensemble will only very rarely be appropriate). To omit variables from the final ensemble, make sure they are excluded from `data`.
`data`	`data.frame` containing the variables in the model. Response must be of class `factor` for classification, `numeric` for (count) regression, `Surv` for survival regression. Input variables must be of class numeric, factor or ordered factor. Otherwise, `pre` will attempt to recode.
`family`	specifies a glm family object. Can be a character string (i.e., `"gaussian"`, `"binomial"`, `"poisson"`, `"multinomial"`, `"cox"` or `"mgaussian"`), or a corresponding family object (e.g., `gaussian`, `binomial` or `poisson`, see `family`). Specification of argument `family` is strongly advised but not required. If `family` is not specified, Otherwise, the program will try to make an informed guess, based on the class of the response variable specified in `formula`. als see Examples below.
`ad.alpha`	Alpha value to be used for computing the penalty weights for the adaptive lasso. Defaults to `NA`, yielding standard lasso estimation. To use adaptive lasso, specify a value between (and including) 0 and 1. A value of 0 will yield ridge-estimated penalty weights for computing the final (lasso) penalized model. See `vignette("relaxed", "pre")` or argument `alpha` of `cv.glmnet`.
`ad.penalty`	Penalty parameter value to be used for computing the penalty weights for the adaptive lasso. Defaults to `"lambda.min"`. If OLS instead of elastic net regression should be used to compute weights, specify `ad.penalty = 0`. See also `vignette("relaxed", "pre")`.
`use.grad`	logical. Should gradient boosting with regression trees be employed when `learnrate > 0`? If `TRUE`, use trees fitted by `ctree` or `rpart` as in Friedman (2001), but without the line search. If `use.grad = FALSE`, `glmtree` instead of `ctree` will be employed for rule induction, yielding longer computation times, higher complexity, but possibly higher predictive accuracy. See Details for supported combinations of `family`, `use.grad` and `learnrate`.
`weights`	optional vector of observation weights to be used for deriving the ensemble.
`type`	character. Specifies type of base learners to include in the ensemble. Defaults to `"both"` (initial ensemble will include both rules and linear functions). Other option are `"rules"` (prediction rules only) or `"linear"` (linear functions only).
`sampfrac`	numeric value `> 0` and `\le 1`. Specifies the fraction of randomly selected training observations used to produce each tree. Values `< 1` will result in sampling without replacement (i.e., subsampling), a value of 1 will result in sampling with replacement (i.e., bootstrap sampling). Alternatively, a sampling function may be supplied, which should take arguments `n` (sample size) and `weights`.
`maxdepth`	positive integer. Maximum number of conditions in rules. If `length(maxdepth) == 1`, it specifies the maximum depth of of each tree grown. If `length(maxdepth) == ntrees`, it specifies the maximum depth of every consecutive tree grown. Alternatively, a random sampling function may be supplied, which takes argument `ntrees` and returns integer values. See also `maxdepth_sampler`.
`learnrate`	numeric value `> 0`. Learning rate or boosting parameter.
`mtry`	positive integer. Number of randomly selected predictor variables for creating each split in each tree. Ignored when `tree.unbiased=FALSE`.
`ntrees`	positive integer value. Number of trees to generate for the initial ensemble.
`confirmatory`	character vector. Specifies one or more confirmatory terms to be included in the final ensemble. Linear terms can be specified as the name of a predictor variable included in `data`, rules can be specified as, for example, `"x1 > 6 & x2 <= 8"`, where x1 and x2 should be names of variables in `data`. Terms thus specified will be included in the final ensemble, as their coefficients will not be penalized in the estimation.
`singleconditions`	`TRUE`, `FALSE` or `"only"`. Should rules with multiple conditions be disaggregated? Node membership for all tree except the root are coded as multi-condition rules. The conditions of these rules can be disaggregated to avoid selection of multi-condition rules. If `FALSE` (the default), all non-root nodes will be included as multi-condition rules in the initial ensemble. If `TRUE`, all nodes will additionally be included as single-condition rules. If `"only"`, all nodes will be included as single-condition rules only.
`winsfrac`	numeric value `> 0` and `\le 0.5`. Quantiles of data distribution to be used for winsorizing linear terms. If set to 0, no winsorizing is performed. Note that ordinal variables are included as linear terms in estimating the regression model and will also be winsorized.
`normalize`	logical. Normalize linear variables before estimating the regression model? Normalizing gives linear terms the same a priori influence as a typical rule, by dividing the (winsorized) linear term by 2.5 times its SD. `normalize = FALSE` will give more preference to linear terms for selection.
`standardize`	logical. Should rules and linear terms be standardized to have SD equal to 1 before estimating the regression model? This will also standardize the dummified factors, users are advised to use the default `standardize = FALSE`.
`ordinal`	logical. Should ordinal variables (i.e., ordered factors) be treated as continuous for generating rules? `TRUE` (the default) generally yields simpler rules, shorter computation times and better generalizability of the final ensemble.
`nfolds`	positive integer. Number of cross-validation folds to be used for selecting the optimal value of the penalty parameter `\lambda` in selecting the final ensemble.
`tree.control`	list with control parameters to be passed to the tree fitting function, generated using `ctree_control`, `mob_control` (if `use.grad = FALSE`), or `rpart.control` (if `tree.unbiased = FALSE`).
`tree.unbiased`	logical. Should an unbiased tree generation algorithm be employed for rule generation? Defaults to `TRUE`, if set to `FALSE`, rules will be generated employing the CART algorithm (which suffers from biased variable selection) as implemented in `rpart`. See details below for possible combinations with `family`, `use.grad` and `learnrate`.
`removecomplements`	logical. Remove rules from the ensemble which are identical to (1 - an earlier rule)?
`removeduplicates`	logical. Remove rules from the ensemble which are identical to an earlier rule?
`verbose`	logical. Should progress be printed to the command line?
`par.init`	logical. Should parallel `foreach` be used to generate initial ensemble? Only used when `learnrate == 0`. Note: Must register parallel beforehand, such as doMC or others. Furthermore, setting `par.init = TRUE` will likely only increase computation time for smaller datasets.
`par.final`	logical. Should parallel `foreach` be used to perform cross validation for selecting the final ensemble? Must register parallel beforehand, such as doMC or others.
`sparse`	logical. Should sparse design matrices be used? May improve computation times for large datasets.
`...`	Further arguments to be passed to `cv.glmnet`.

Details

Note that obervations with missing values will be removed prior to analysis (and a warning printed).

In some cases, duplicated variable names may appear in the model. For example, the first variable is a factor named 'V1' and there are also variables named 'V10' and/or 'V11' and/or 'V12' (etc). Then for for the binary factor V1, dummy contrast variables will be created, named 'V10', 'V11', 'V12' (etc). As should be clear from this example, this yields duplicated variable names, which may yield problems, for example in the calculation of predictions and importances, later on. This can be prevented by renaming factor variables with numbers in their name, prior to analysis.

The table below provides an overview of combinations of response variable types, use.grad, tree.unbiased and learnrate settings that are supported, and the tree induction algorithm that will be employed as a result:

use.grad	tree.unbiased	learnrate	family	tree alg.	Response variable format

TRUE	TRUE	0	gaussian	ctree	Single, numeric (non-integer)
TRUE	TRUE	0	mgaussian	ctree	Multiple, numeric (non-integer)
TRUE	TRUE	0	binomial	ctree	Single, factor with 2 levels
TRUE	TRUE	0	multinomial	ctree	Single, factor with >2 levels
TRUE	TRUE	0	poisson	ctree	Single, integer
TRUE	TRUE	0	cox	ctree	Object of class 'Surv'

TRUE	TRUE	>0	gaussian	ctree	Single, numeric (non-integer)
TRUE	TRUE	>0	mgaussian	ctree	Multiple, numeric (non-integer)
TRUE	TRUE	>0	binomial	ctree	Single, factor with 2 levels
TRUE	TRUE	>0	multinomial	ctree	Single, factor with >2 levels
TRUE	TRUE	>0	poisson	ctree	Single, integer
TRUE	TRUE	>0	cox	ctree	Object of class 'Surv'

FALSE	TRUE	0	gaussian	glmtree	Single, numeric (non-integer)
FALSE	TRUE	0	binomial	glmtree	Single, factor with 2 levels
FALSE	TRUE	0	poisson	glmtree	Single, integer

FALSE	TRUE	>0	gaussian	glmtree	Single, numeric (non-integer)
FALSE	TRUE	>0	binomial	glmtree	Single, factor with 2 levels
FALSE	TRUE	>0	poisson	glmtree	Single, integer

TRUE	FALSE	0	gaussian	rpart	Single, numeric (non-integer)
TRUE	FALSE	0	binomial	rpart	Single, factor with 2 levels
TRUE	FALSE	0	multinomial	rpart	Single, factor with >2 levels
TRUE	FALSE	0	poisson	rpart	Single, integer
TRUE	FALSE	0	cox	rpart	Object of class 'Surv'

TRUE	FALSE	>0	gaussian	rpart	Single, numeric (non-integer)
TRUE	FALSE	>0	binomial	rpart	Single, factor with 2 levels
TRUE	FALSE	>0	poisson	rpart	Single, integer
TRUE	FALSE	>0	cox	rpart	Object of class 'Surv'

If an error along the lines of 'factor ... has new levels ...' is encountered, consult ?rare_level_sampler for explanation and solutions.

Value

An object of class pre. It contains the initial ensemble of rules and/or linear terms and a range of possible final ensembles. By default, the final ensemble employed by all other methods and functions in package pre is selected using the 'minimum cross validated error plus 1 standard error' criterion. All functions and methods for objects of class pre take a penalty.parameter.val argument, which can be used to select a different criterion.

If only a set of rules needs to be generated, but the final regression model should not be fitted, specify the hidden argument fit.final = FALSE.

Note

Parts of the code for deriving rules from the nodes of trees was copied with permission from an internal function of the partykit package, written by Achim Zeileis and Torsten Hothorn.

References

Fokkema, M. (2020). Fitting prediction rule ensembles with R package pre. Journal of Statistical Software, 92(12), 1-30. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v092.i12")}

Fokkema, M. & Strobl, C. (2020). Fitting prediction rule ensembles to psychological research data: An introduction and tutorial. Psychological Methods 25(5), 636-652. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1037/met0000256")}, https://arxiv.org/abs/1907.05302

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Applied Statistics, 29(5), 1189-1232.

Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), 916-954, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/07-AOAS148")}.

Hothorn, T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16, 3905-3909.

Examples

## Fit pre to a continuous response:
airq <- airquality[complete.cases(airquality), ]
set.seed(42)
airq.ens <- pre(Ozone ~ ., data = airq)
airq.ens
## Use relaxed lasso to estimate final model
airq.ens.rel <- pre(Ozone ~ ., data = airq, relax = TRUE)
airq.ens.rel
## Use adaptive lasso to estimate final model
airq.ens.ad <- pre(Ozone ~ ., data = airq, ad.alpha = 0)
airq.ens.ad

## Fit pre to a binary response:
airq2 <- airquality[complete.cases(airquality), ]
airq2$Ozone <- factor(airq2$Ozone > median(airq2$Ozone))
set.seed(42)
airq.ens2 <- pre(Ozone ~ ., data = airq2, family = "binomial")
airq.ens2

## Fit pre to a multivariate continuous response:
airq3 <- airquality[complete.cases(airquality), ] 
set.seed(42)
airq.ens3 <- pre(Ozone + Wind ~ ., data = airq3, family = "mgaussian")
airq.ens3

## Fit pre to a multinomial response:
set.seed(42)
iris.ens <- pre(Species ~ ., data = iris, family = "multinomial")
iris.ens

## Fit pre to a survival response:
library("survival")
lung <- lung[complete.cases(lung), ]
set.seed(42)
lung.ens <- pre(Surv(time, status) ~ ., data = lung, family = "cox")
lung.ens

## Fit pre to a count response:
## Generate random data (partly based on Dobson (1990) Page 93: Randomized 
## Controlled Trial):
counts <- rep(as.integer(c(18, 17, 15, 20, 10, 20, 25, 13, 12)), times = 10)
outcome <- rep(gl(3, 1, 9), times = 10)
treatment <- rep(gl(3, 3), times = 10)
noise1 <- 1:90
set.seed(1)
noise2 <- rnorm(90)
countdata <- data.frame(treatment, outcome, counts, noise1, noise2)
set.seed(42)
count.ens <- pre(counts ~ ., data = countdata, family = "poisson")
count.ens

pre documentation built on May 29, 2024, 5:10 a.m.