RandForest | R Documentation |
RandForest
implements a version of Breiman's random forest algorithm for classification and regression.
RandForest(formula, data, subset, verbose=interactive(),
weights, na.action,
method='rf.fit',
rf.mode=c('auto', 'classification', 'regression'),
contrasts=NULL, ...)
## S3 method for class 'RandForest'
predict(object, newdata=NULL,
na.action=na.pass, ...)
## Called internally by `RandForest`
RandForest.fit(x, y=NULL,
verbose=interactive(), ntree=10,
mtry=floor(sqrt(ncol(x))),
weights=NULL, replace=TRUE,
sampsize=if(replace) nrow(x) else ceiling(0.632*nrow(x)),
nodesize=1L, max_depth=NULL,
method=NULL,
terms=NULL,...)
formula |
an object of class " |
data |
An optional data frame, list, or environment (or object coercible by |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used in the fitting process. Should be |
na.action |
a function which indicates what should happen when the data contain |
method |
currently unused. |
rf.mode |
one of |
contrasts |
currently experimental; see |
... |
further arguments passed to |
object |
an object of class |
newdata |
new data to predict on, typically provided as a |
verbose |
logical: should progress be displayed? |
ntree |
number of decision trees to grow. |
mtry |
number of variables to try at each split. |
replace |
logical; should data be sampled with replacement during training? |
sampsize |
number of datapoints to sample for training each component decision tree. |
nodesize |
number of datapoints to stop classification (see "Details") |
max_depth |
maximum depth of component decision trees. |
x |
used internally by |
y |
used internally by |
terms |
used internally by |
RandForest
implements a version of Breiman's original algorithm to train a random forest model for classification or regression. Random forests are comprised of a set of decision trees, each of which is trained on a subset of the available data. These trees are individually worse predictors than a single decision tree trained on the entire dataset. However, averaging predictions across the ensemble of trees forms a model that is often more accurate than single decision trees while being less susceptible to overfitting.
Random forests can either be trained for classification or regression. Classification forests are comprised of trees that assign inputs to a specific class. The output prediction is a vector comprised of the proportion of trees in the forest that assigned the datapoint to each available class. Regresssion forests are comprised of trees that assign each datapoint to a single continuous value, and the output prediction is comprised of the mean prediction across all component trees. When rf.mode="auto"
, the random forest will be trained in classification mode for response of type "factor"
, and in regression mode for response of type "numeric"
.
Several parameters exist to tune the behavior of random forests. The ntree
argument controls how many decision trees are trained. At each decision point, the decision trees consider a random subset of available variables–the number of variables to sample is controlled by mtry
. Each decision tree only sees a subset of available data to reduce its risk of overfitting. This subset is comprised of sampsize
datapoints, which are sampled with or without replacement according to the replace
argument.
Finally, the default behavior is to grow decision trees until they have fully classified all the data they see for training. However, this may lead to overfitting. Decision trees can be limited to smaller sizes by specifying the max_depth
or nodesize
arguments. max_depth
refers to the depth of the decision tree. Setting this value to n
means that every path from the root node to a leaf node will be at most length n
. nodesize
can be used to instead stop growing trees based on the size of the data to be partitioned at each decision tree node. If nodesize=n
, then if a decision point receives less than n
samples, it will stop trying to further split the data.
Classification forests are trained by maximizing the Gini Gain at each interior node. Split points are determined with exhaustive search for small data sizes, or simulated annealing for larger sizes. Regression forests are trained by maximizing the decrease in sum of squared error (SSE) if all points in each partition are assigned their mean output value. Nodes stop classification when either no partition improves the maximization metric (Gini Gain or decrease in SSE) or when the criteria specified by nodesize
/ max_depth
are met.
Some of the arguments provided are for consistency with the base lm
function. Use caution changing any values referred to as "Experimental" above. NA
values may cause unintended behavior.
An object of class 'RandForest'
, which itself contains a number of objects of class 'DecisionTree'
which can be used for prediction with predict.RandForest
Generating a single decision tree model is possible by setting ntree=1
and sampsize=nrow(data)
. 'DecisionTree'
objects do not currently support prediction.
Aidan Lakshman ahl27@pitt.edu
Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
DecisionTree class
set.seed(199L)
n_samp <- 100L
AA <- rnorm(n_samp, mean=1, sd=5)
BB <- rnorm(n_samp, mean=2, sd=3)
CC <- rgamma(n_samp, shape=1, rate=2)
err <- rnorm(n_samp, sd=0.5)
y <- AA + BB + 2*CC + err
d <- data.frame(AA,BB,CC,y)
train_i <- 1:90
test_i <- 91:100
train_data <- d[train_i,]
test_data <- d[test_i,]
rf_regr <- RandForest(y~., data=train_data, rf.mode="regression", max_depth=5L)
if(interactive()){
# Visualize one of the decision trees
plot(rf_regr[[1]])
}
## classification
y1 <- y < -5
y2 <- y < 0 & y >= -5
y3 <- y < 5 & y >= 0
y4 <- y >= 5
y_cl <- rep(0L, length(y))
y[y1] <- 1L
y[y2] <- 2L
y[y3] <- 3L
y[y4] <- 4L
d$y <- as.factor(y)
train_data <- d[train_i,]
test_data <- d[test_i,]
rf_classif <- RandForest(y~., data=train_data, rf.mode="classification", max_depth=5L)
if(interactive()){
# Visualize one of the decision trees for classification
plot(rf_classif[[1]])
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.