beset_rf: Beset Random Forest

View source: R/beset_rf.R

beset_rfR Documentation

Beset Random Forest

Description

beset_rf is a wrapper to randomForest that estimates predictive performance of the random forest using repeated k-fold cross-validation. beset_rf insures that the correct arguments are provided to randomForest and that enough information is retained for compatibility with beset methods such as variable importance and partial dependence.

Usage

beset_rf(
  form,
  data,
  n_trees = 500,
  sample_rate = 1 - exp(-1),
  mtry = NULL,
  min_obs_in_node = NULL,
  n_folds = 10,
  n_reps = 10,
  seed = 42,
  class_wt = NULL,
  cutoff = NULL,
  strata = NULL,
  parallel_type = NULL,
  n_cores = NULL,
  cl = NULL
)

## S3 method for class 'beset_rf'
plot(x, metric = c("auto", "mse", "rsq", "err.rate"), ...)

Arguments

form

A model formula.

data

Either a data_partition object containing data sets to be used for both model training and testing, or a single data frame that will be used for model training and cross-validation.

n_trees

Number of trees. Defaults to 500.

sample_rate

Row sample rate per tree (from 0 to 1). Defaults to 1 - exp(1), or ~ 0.632.

mtry

(Optional) integer number of variables randomly sampled as candidates at each split. If omitted, defaults to the square root of the number of predictors for classification and one-third the number of predictors for regression.

min_obs_in_node

(Optional) integer number specifying the fewest allowed observations in a terminal node. If omitted, defaults to 1 for classification and 5 for regression.

n_folds

Integer indicating the number of folds to use for cross-validation.

n_reps

Integer indicating the number of times cross-validation should be repeated (with different randomized fold assignments).

seed

Integer used to seed the random number generator when assigning observations to folds.

class_wt

Priors of the classes. Ignored for regression.

cutoff

(Classification only) A vector of length equal to number of classes. The ‘winning’ class for an observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is the number of classes (i.e., majority vote wins).

strata

A (factor) variable that is used for stratified sampling.

parallel_type

(Optional) character string indicating the type of parallel operation to be used, either "fork" or "sock". If omitted and n_cores > 1, the default is "sock" for Windows and otherwise either "fork" or "sock" depending on which process is being run.

n_cores

Integer value indicating the number of workers to run in parallel during subset search and cross-validation. By default, this will be set to one fewer than the maximum number of physical cores you have available, as indicated by detectCores. Set to 1 to disable parallel processing.

cl

(Optional) parallel or snow cluster for use if parallel_type = "sock". If not supplied, a cluster on the local machine is automatically created.

x

A "beset_rf" object to plot

metric

Prediction metric to plot. Options are mean squared error ("mse") or R-squared ("rsq") for regression, and misclassification error ("err.rate") for classification. Default "auto" plots MSE for regression and error rate for classification.

...

optional parameters to be passed to the low level function randomForest.default.

Value

A "beset_rf" object with the following components:

forests

list of "randomForest" objects for each fold and repetition

stats

a "cross_valid" object giving cross-validation metrics

data

the data frame used to train random forest

Methods (by generic)

  • plot(beset_rf): Plot OOB and holdout MSE, R-squared, or error rate as a function of number of trees in forest

Examples

# Using default 10 X 10 repeated k-fold cross-validation
data("prostate", package = "beset")
rf <- beset_rf(tumor ~ ., data = prostate)
summary(rf)
plot(rf)

# Using a single independent test set instead of cross-validation
inTrain <- sample.int(nrow(prostate), nrow(prostate)/2)
data <- data_partition(
  train = prostate[inTrain,], test = prostate[-inTrain,], y = "tumor"
)
rf <- beset_rf(tumor ~ ., data = data)
summary(rf)
plot(rf)

# Example with continuous outcome
rf <- beset_rf(gleason ~ ., data = data)
summary(rf)
plot(rf)

jashu/beset documentation built on April 20, 2023, 5:28 a.m.