beset_rf: Beset Random Forest
In jashu/beset: Best Subset Predictive Modeling

beset_rf

R Documentation

Beset Random Forest

Description

beset_rf is a wrapper to randomForest that estimates predictive performance of the random forest using repeated k-fold cross-validation. beset_rf insures that the correct arguments are provided to randomForest and that enough information is retained for compatibility with beset methods such as variable importance and partial dependence.

Usage

beset_rf(
  form,
  data,
  n_trees = 500,
  sample_rate = 1 - exp(-1),
  mtry = NULL,
  min_obs_in_node = NULL,
  n_folds = 10,
  n_reps = 10,
  seed = 42,
  class_wt = NULL,
  cutoff = NULL,
  strata = NULL,
  parallel_type = NULL,
  n_cores = NULL,
  cl = NULL
)

## S3 method for class 'beset_rf'
plot(x, metric = c("auto", "mse", "rsq", "err.rate"), ...)

Arguments

`form`	A model `formula`.
`data`	Either a `data_partition` object containing data sets to be used for both model training and testing, or a single data frame that will be used for model training and cross-validation.
`n_trees`	Number of trees. Defaults to 500.
`sample_rate`	Row sample rate per tree (from `0 to 1`). Defaults to `1 - exp(1), or ~ 0.632`.
`mtry`	(Optional) `integer` number of variables randomly sampled as candidates at each split. If omitted, defaults to the square root of the number of predictors for classification and one-third the number of predictors for regression.
`min_obs_in_node`	(Optional) `integer` number specifying the fewest allowed observations in a terminal node. If omitted, defaults to 1 for classification and 5 for regression.
`n_folds`	`Integer` indicating the number of folds to use for cross-validation.
`n_reps`	`Integer` indicating the number of times cross-validation should be repeated (with different randomized fold assignments).
`seed`	`Integer` used to seed the random number generator when assigning observations to folds.
`class_wt`	Priors of the classes. Ignored for regression.
`cutoff`	(Classification only) A vector of length equal to number of classes. The ‘winning’ class for an observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is the number of classes (i.e., majority vote wins).
`strata`	A (factor) variable that is used for stratified sampling.
`parallel_type`	(Optional) character string indicating the type of parallel operation to be used, either `"fork"` or `"sock"`. If omitted and `n_cores > 1`, the default is `"sock"` for Windows and otherwise either `"fork"` or `"sock"` depending on which process is being run.
`n_cores`	Integer value indicating the number of workers to run in parallel during subset search and cross-validation. By default, this will be set to one fewer than the maximum number of physical cores you have available, as indicated by `detectCores`. Set to 1 to disable parallel processing.
`cl`	(Optional) `parallel` or `snow` cluster for use if `parallel_type = "sock"`. If not supplied, a cluster on the local machine is automatically created.
`x`	A `"beset_rf"` object to plot
`metric`	Prediction metric to plot. Options are mean squared error (`"mse"`) or R-squared (`"rsq"`) for regression, and misclassification error (`"err.rate"`) for classification. Default `"auto"` plots MSE for regression and error rate for classification.
`...`	optional parameters to be passed to the low level function `randomForest.default`.

Value

A "beset_rf" object with the following components:

forests: list of "randomForest" objects for each fold and repetition
stats: a "cross_valid" object giving cross-validation metrics
data: the data frame used to train random forest

Methods (by generic)

plot(beset_rf): Plot OOB and holdout MSE, R-squared, or error rate as a function of number of trees in forest

Examples

# Using default 10 X 10 repeated k-fold cross-validation
data("prostate", package = "beset")
rf <- beset_rf(tumor ~ ., data = prostate)
summary(rf)
plot(rf)

# Using a single independent test set instead of cross-validation
inTrain <- sample.int(nrow(prostate), nrow(prostate)/2)
data <- data_partition(
  train = prostate[inTrain,], test = prostate[-inTrain,], y = "tumor"
)
rf <- beset_rf(tumor ~ ., data = data)
summary(rf)
plot(rf)

# Example with continuous outcome
rf <- beset_rf(gleason ~ ., data = data)
summary(rf)
plot(rf)

jashu/beset documentation built on April 20, 2023, 5:28 a.m.