Description Usage Arguments Value Note See Also Examples
Trains the random forest. The type of response the random forest can be
trained on varies depending on the splitFinder
,
nodeResponseCombiner
, and the forestResponseCombiner
parameters. Make sure these are compatible with each other, and with the
response you plug in. splitFinder
should work on the responses you are
providing; nodeResponseCombiner
should combine these responses into
some intermediate product, and forestResponseCombiner
combines these
intermediate products into the final output product. Note that
nodeResponseCombiner
and forestResponseCombiner
can be inferred
from the data (so feel free to not specify them), and splitFinder
can
be inferred but you might want to change its default.
1 2 3 4 5 6 | train(formula, data, splitFinder = NULL, nodeResponseCombiner = NULL,
forestResponseCombiner = NULL, ntree, numberOfSplits, mtry, nodeSize,
maxNodeDepth = 1e+05, na.penalty = TRUE, splitPureNodes = TRUE,
savePath = NULL, savePath.overwrite = c("warn", "delete", "merge"),
forest.output = c("online", "offline"), cores = getCores(),
randomSeed = NULL, displayProgress = TRUE)
|
formula |
You may specify the response and covariates as a formula instead; make sure the response in the formula is still properly constructed. |
data |
A data.frame containing the columns of the predictors and responses. |
splitFinder |
A split finder that's used to score splits in the random
forest training algorithm. See |
nodeResponseCombiner |
A response combiner that's used to combine
responses for each terminal node in a tree (regression example; average the
observations in each tree into a single number). See
|
forestResponseCombiner |
A response combiner that's used to combine
predictions across trees into one final result (regression example; average
the prediction of each tree into a single number). See
|
ntree |
An integer that specifies how many trees should be trained. |
numberOfSplits |
A tuning parameter specifying how many random splits should be tried for a covariate; a value of 0 means all splits will be tried (with an exception for factors, who might have too many splits to feasibly compute). |
mtry |
A tuning parameter specifying how many covariates will be randomly chosen to be tried in the splitting process. This value must be at least 1. |
nodeSize |
The algorithm will not attempt to split a node that has
observations less than 2* |
maxNodeDepth |
This parameter is analogous to |
na.penalty |
This parameter controls whether predictor variables with
NAs should be penalized when being considered for a best split. Best splits
(and the associated score) are determined on only non-NA data; the penalty
is to take the best split identified, and to randomly assign any NAs
(according to the proportion of data split left and right), and then
recalculate the corresponding split score, when is then compared with the
other split candiate variables. This penalty adds some computational time,
so it may be disabled for some variables. |
splitPureNodes |
This parameter determines whether the algorithm will
split a pure node. If set to FALSE, then before every split it will check
that every response is the same, and if so, not split. If set to TRUE it
forgoes that check and splits it. Prediction accuracy won't change under
any sensible |
savePath |
If set, this parameter will save each tree of the random
forest in this directory as the forest is trained. Use this parameter if
you need to save memory while training. See also |
savePath.overwrite |
This parameter controls the behaviour for what
happens if |
forest.output |
This parameter only applies if |
cores |
This parameter specifies how many trees will be simultaneously
trained. By default the package attempts to detect how many cores you have
by using the |
randomSeed |
This parameter specifies a random seed if reproducible, deterministic forests are desired. |
displayProgress |
A logical indicating whether the progress should be
displayed to console; default is |
A JRandomForest
object. You may call predict
or
print
on it.
If saving memory is a concern, you can replace covariateData
or
data
with an environment containing one element called data
as the actual dataset. After the data has been imported into Java, but
before the forest training begins, the dataset in the environment is
deleted, freeing up memory in R.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | # Regression Example
x1 <- rnorm(1000)
x2 <- rnorm(1000)
y <- 1 + x1 + x2 + rnorm(1000)
data <- data.frame(x1, x2, y)
forest <- train(y ~ x1 + x2, data, WeightedVarianceSplitFinder(),
MeanResponseCombiner(), MeanResponseCombiner(), ntree=100,
numberOfSplits = 5, mtry = 1, nodeSize = 5)
# Fix x2 to be 0
newData <- data.frame(x1 = seq(from=-2, to=2, by=0.5), x2 = 0)
ypred <- predict(forest, newData)
plot(ypred ~ newData$x1, type="l")
# Competing Risk Example
x1 <- abs(rnorm(1000))
x2 <- abs(rnorm(1000))
T1 <- rexp(1000, rate=x1)
T2 <- rweibull(1000, shape=x1, scale=x2)
C <- rexp(1000)
u <- pmin(T1, T2, C)
delta <- ifelse(u==T1, 1, ifelse(u==T2, 2, 0))
data <- data.frame(x1, x2)
forest <- train(CR_Response(delta, u) ~ x1 + x2, data,
LogRankSplitFinder(1:2), CR_ResponseCombiner(1:2),
CR_FunctionCombiner(1:2), ntree=100, numberOfSplits=5,
mtry=1, nodeSize=10)
newData <- data.frame(x1 = c(-1, 0, 1), x2 = 0)
ypred <- predict(forest, newData)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.