train: Train Random Forests

Description Usage Arguments Value Note See Also Examples

View source: R/train.R

Description

Trains the random forest. The type of response the random forest can be trained on varies depending on the splitFinder, nodeResponseCombiner, and the forestResponseCombiner parameters. Make sure these are compatible with each other, and with the response you plug in. splitFinder should work on the responses you are providing; nodeResponseCombiner should combine these responses into some intermediate product, and forestResponseCombiner combines these intermediate products into the final output product. Note that nodeResponseCombiner and forestResponseCombiner can be inferred from the data (so feel free to not specify them), and splitFinder can be inferred but you might want to change its default.

Usage

1
2
3
4
5
6
train(formula, data, splitFinder = NULL, nodeResponseCombiner = NULL,
  forestResponseCombiner = NULL, ntree, numberOfSplits, mtry, nodeSize,
  maxNodeDepth = 1e+05, na.penalty = TRUE, splitPureNodes = TRUE,
  savePath = NULL, savePath.overwrite = c("warn", "delete", "merge"),
  forest.output = c("online", "offline"), cores = getCores(),
  randomSeed = NULL, displayProgress = TRUE)

Arguments

formula

You may specify the response and covariates as a formula instead; make sure the response in the formula is still properly constructed.

data

A data.frame containing the columns of the predictors and responses.

splitFinder

A split finder that's used to score splits in the random forest training algorithm. See CompetingRiskSplitFinders or WeightedVarianceSplitFinder. If you don't specify one, this function tries to pick one based on the response. For CR_Response without censor times, it will pick a LogRankSplitFinder; while if censor times were provided it will pick GrayLogRankSplitFinder; for integer or numeric responses it picks a WeightedVarianceSplitFinder.

nodeResponseCombiner

A response combiner that's used to combine responses for each terminal node in a tree (regression example; average the observations in each tree into a single number). See CR_ResponseCombiner or MeanResponseCombiner. If you don't specify one, this function tries to pick one based on the response. For CR_Response it picks a CR_ResponseCombiner; for integer or numeric responses it picks a MeanResponseCombiner.

forestResponseCombiner

A response combiner that's used to combine predictions across trees into one final result (regression example; average the prediction of each tree into a single number). See CR_FunctionCombiner or MeanResponseCombiner. If you don't specify one, this function tries to pick one based on the response. For CR_Response it picks a CR_FunctionCombiner; for integer or numeric responses it picks a MeanResponseCombiner.

ntree

An integer that specifies how many trees should be trained.

numberOfSplits

A tuning parameter specifying how many random splits should be tried for a covariate; a value of 0 means all splits will be tried (with an exception for factors, who might have too many splits to feasibly compute).

mtry

A tuning parameter specifying how many covariates will be randomly chosen to be tried in the splitting process. This value must be at least 1.

nodeSize

The algorithm will not attempt to split a node that has observations less than 2*nodeSize; this guarantees that any two sibling terminal nodes together have an average size of at least nodeSize; note that it doesn't guarantee that every node is at least as large as nodeSize.

maxNodeDepth

This parameter is analogous to nodeSize in that it controls tree length; by default maxNodeDepth is an extremely high number and tree depth is controlled by nodeSize.

na.penalty

This parameter controls whether predictor variables with NAs should be penalized when being considered for a best split. Best splits (and the associated score) are determined on only non-NA data; the penalty is to take the best split identified, and to randomly assign any NAs (according to the proportion of data split left and right), and then recalculate the corresponding split score, when is then compared with the other split candiate variables. This penalty adds some computational time, so it may be disabled for some variables. na.penalty may be specified as a vector of logicals indicating, for each predictor variable, whether the penalty should be applied to that variable. If it's length 1 then it applies to all variables. Alternatively, a single numeric value may be provided to indicate a threshold whereby the penalty is activated only if the proportion of NAs for that variable in the training set exceeds that threshold.

splitPureNodes

This parameter determines whether the algorithm will split a pure node. If set to FALSE, then before every split it will check that every response is the same, and if so, not split. If set to TRUE it forgoes that check and splits it. Prediction accuracy won't change under any sensible nodeResponseCombiner; as all terminal nodes from a split pure node should give the same prediction, so this parameter only affects performance. If your response is continuous you'll likely experience faster train times by setting it to TRUE. Default value is TRUE.

savePath

If set, this parameter will save each tree of the random forest in this directory as the forest is trained. Use this parameter if you need to save memory while training. See also loadForest

savePath.overwrite

This parameter controls the behaviour for what happens if savePath is pointing to an existing directory. If set to warn (default) then train refuses to proceed. If set to delete then all the contents in that folder are deleted for the new forest to be trained. Note that all contents are deleted, even those files not related to largeRCRF. Use only if you're sure it's safe. If set to merge, then the files describing the forest (such as its parameters) are overwritten but the saved trees are not. The algorithm assumes (without checking) that the existing trees are from a previous run and starts from where it left off. This option is useful if recovering from a crash.

forest.output

This parameter only applies if savePath has been set; set to 'online' (default) and the saved forest will be loaded into memory after being trained. Set to 'offline' and the forest is not saved into memory, but can still be used in a memory unintensive manner.

cores

This parameter specifies how many trees will be simultaneously trained. By default the package attempts to detect how many cores you have by using the parallel package and using all of them. You may specify a lower number if you wish. It is not recommended to specify a number greater than the number of available cores as this will hurt performance with no available benefit.

randomSeed

This parameter specifies a random seed if reproducible, deterministic forests are desired.

displayProgress

A logical indicating whether the progress should be displayed to console; default is TRUE. Useful to set to FALSE in some automated situations.

Value

A JRandomForest object. You may call predict or print on it.

Note

If saving memory is a concern, you can replace covariateData or data with an environment containing one element called data as the actual dataset. After the data has been imported into Java, but before the forest training begins, the dataset in the environment is deleted, freeing up memory in R.

See Also

predict.JRandomForest

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Regression Example
x1 <- rnorm(1000)
x2 <- rnorm(1000)
y <- 1 + x1 + x2 + rnorm(1000)

data <- data.frame(x1, x2, y)
forest <- train(y ~ x1 + x2, data, WeightedVarianceSplitFinder(),
  MeanResponseCombiner(), MeanResponseCombiner(), ntree=100,
  numberOfSplits = 5, mtry = 1, nodeSize = 5)

# Fix x2 to be 0
newData <- data.frame(x1 = seq(from=-2, to=2, by=0.5), x2 = 0)
ypred <- predict(forest, newData)

plot(ypred ~ newData$x1, type="l")

# Competing Risk Example
x1 <- abs(rnorm(1000))
x2 <- abs(rnorm(1000))

T1 <- rexp(1000, rate=x1)
T2 <- rweibull(1000, shape=x1, scale=x2)
C <- rexp(1000)
u <- pmin(T1, T2, C)
delta <- ifelse(u==T1, 1, ifelse(u==T2, 2, 0))

data <- data.frame(x1, x2)

forest <- train(CR_Response(delta, u) ~ x1 + x2, data,
   LogRankSplitFinder(1:2), CR_ResponseCombiner(1:2),
   CR_FunctionCombiner(1:2), ntree=100, numberOfSplits=5,
   mtry=1, nodeSize=10)
newData <- data.frame(x1 = c(-1, 0, 1), x2 = 0)
ypred <- predict(forest, newData)

jatherrien/largeRCRF documentation built on Nov. 15, 2019, 7:16 a.m.