forestry: forestry

Description Usage Arguments Details Value See Also Examples

View source: R/forestry.R

Description

forestry is a fast implementation of a variety of tree-based estimators. Implemented estimators include CART trees, randoms forests, boosted trees and forests, and linear trees and forests. All estimators are implemented to scale well with very large datasets.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
forestry(
  x,
  y,
  ntree = 500,
  replace = TRUE,
  sampsize = if (replace) nrow(x) else ceiling(0.632 * nrow(x)),
  sample.fraction = NULL,
  mtry = max(floor(ncol(x)/3), 1),
  nodesizeSpl = 3,
  nodesizeAvg = 3,
  nodesizeStrictSpl = 1,
  nodesizeStrictAvg = 1,
  minSplitGain = 0,
  maxDepth = round(nrow(x)/2) + 1,
  splitratio = 1,
  seed = as.integer(runif(1) * 1000),
  verbose = FALSE,
  nthread = 0,
  splitrule = "variance",
  middleSplit = FALSE,
  maxObs = length(y),
  maxProp = 1,
  linear = FALSE,
  splitFeats = 1:(ncol(x)),
  linFeats = 1:(ncol(x)),
  monotonicConstraints = rep(0, ncol(x)),
  sampleWeights = rep((1/ncol(x)), ncol(x)),
  overfitPenalty = 1,
  doubleTree = FALSE,
  reuseforestry = NULL,
  saveable = TRUE
)

Arguments

x

A data frame of all training predictors.

y

A vector of all training responses.

ntree

The number of trees to grow in the forest. The default value is 500.

replace

An indicator of whether sampling of training data is with replacement. The default value is TRUE.

sampsize

The size of total samples to draw for the training data. If sampling with replacement, the default value is the length of the training data. If samplying without replacement, the default value is two-third of the length of the training data.

sample.fraction

If this is given, then sampsize is ignored and set to be round(length(y) * sample.fraction). It must be a real number between 0 and 1

mtry

The number of variables randomly selected at each split point. The default value is set to be one third of total number of features of the training data.

nodesizeSpl

Minimum observations contained in terminal nodes. The default value is 3.

nodesizeAvg

Minimum size of terminal nodes for averaging dataset. The default value is 3.

nodesizeStrictSpl

Minimum observations to follow strictly in terminal nodes. The default value is 1.

nodesizeStrictAvg

Minimum size of terminal nodes for averaging dataset to follow strictly. The default value is 1.

minSplitGain

Minimum loss reduction to split a node further in a tree. specifically this is the percentage R squared increase which each potential split must give to be considered. The default value is 0.

maxDepth

Maximum depth of a tree. The default value is 99.

splitratio

Proportion of the training data used as the splitting dataset. It is a ratio between 0 and 1. If the ratio is 1, then essentially splitting dataset becomes the total entire sampled set and the averaging dataset is empty. If the ratio is 0, then the splitting data set is empty and all the data is used for the averaging data set (This is not a good usage however since there will be no data available for splitting).

seed

Seed for random number generator.

verbose

Flag to indicate if training process is verbose.

nthread

Number of threads to train and predict the forest. The default number is 0 which represents using all cores.

splitrule

Only variance is implemented at this point and it specifies the loss function according to which the splits of random forest should be made.

middleSplit

Flag to indicate whether the split value takes the average of two feature values. If false, it will take a point based on a uniform distribution between two feature values. The default value is FALSE.

maxObs

The max number of observations to split on. If set to a number less than nrow(x), at each split point, maxObs split points will be randomly sampled to test as potential splitting points instead of every feature value (default).

maxProp

A complementary option to 'maxObs', 'maxProp' allows one to specify the proportion of possible split points which are downsampled at each point to test potential splitting points. For example, a value of .35 will randomly select 35 splitting poimts at each split. If values of 'maxProp' and 'maxObs' are both supplied, the value of 'maxProp' will take precedent. At the lower levels of the tree, we will select Max('maxProp'* n, nodesizeSpl) splitting observations.

linear

Fit the model with a split function optimizing for a linear aggregation function instead of a constant aggregation function. The default value is FALSE.

splitFeats

Specify which features to split on when creating a tree (defaults to use all features).

linFeats

Specify which features to split linearly on when using linear (defaults to use all numerical features)

monotonicConstraints

Specifies monotonic relationships between the continuous features and the outcome. Supplied as a vector of length p with entries in 1,0,-1 which 1 indicating an increasing monotonic relationship, -1 indicating a decreasing monotonic relationship, and 0 indicating no relationship. Constraints supplied for categorical will be ignored.

sampleWeights

Specify weights for weighted uniform distribution used to randomly sample features.

overfitPenalty

Value to determine how much to penalize magnitude of coefficients in ridge regression when using linear. The default value is 1.

doubleTree

Indicate whether the number of tree is doubled as averaging and splitting data can be exchanged to create decorrelated trees. The default value is FALSE.

reuseforestry

Pass in a 'forestry' object which will recycle the dataframe the old object created. It will save some space working on the same dataset.

saveable

If TRUE, then RF is created in such a way that it can be saved and loaded using save(...) and load(...). Setting it to TRUE (default) will, however, take longer and it will use more memory. When training many RF, it makes a lot of sense to set this to FALSE to save time and memory.

Details

For Linear Random Forests, set the linear option to TRUE and specify lambda for ridge regression with overfitPenalty parameter. For gradient boosting and gradient boosting forests, see mulitlayer-forestry.

Value

A 'forestry' object.

See Also

predict.forestry

multilayer-forestry

predict-multilayer-forestry

getVI

getOOB

make_savable

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
set.seed(292315)
library(forestry)
test_idx <- sample(nrow(iris), 3)
x_train <- iris[-test_idx, -1]
y_train <- iris[-test_idx, 1]
x_test <- iris[test_idx, -1]

rf <- forestry(x = x_train, y = y_train)
weights = predict(rf, x_test, aggregation = "weightMatrix")$weightMatrix

weights %*% y_train
predict(rf, x_test)

set.seed(49)
library(forestry)

n <- c(100)
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 4*a + 5.5*b - .78*c
x <- data.frame(a,b,c)

forest <- forestry(
          x,
          y,
          ntree = 10,
          replace = TRUE,
          nodesizeStrictSpl = 5,
          nodesizeStrictAvg = 5,
          linear = TRUE
          )

predict(forest, x)

soerenkuenzel/forestry documentation built on April 25, 2021, 10:02 a.m.