s_Ranger: Random Forest Classification and Regression (C, R)

View source: R/s_Ranger.R

s_RangerR Documentation

Random Forest Classification and Regression (C, R)

Description

Train a Random Forest for regression or classification using ranger

Usage

s_Ranger(
  x,
  y = NULL,
  x.test = NULL,
  y.test = NULL,
  x.name = NULL,
  y.name = NULL,
  n.trees = 1000,
  weights = NULL,
  ifw = TRUE,
  ifw.type = 2,
  ifw.case.weights = TRUE,
  ifw.class.weights = FALSE,
  upsample = FALSE,
  downsample = FALSE,
  resample.seed = NULL,
  autotune = FALSE,
  classwt = NULL,
  n.trees.try = 500,
  stepFactor = 2,
  mtry = NULL,
  mtryStart = NULL,
  inbag.resample = NULL,
  stratify.on.y = FALSE,
  grid.resample.params = setup.resample("kfold", 5),
  gridsearch.type = c("exhaustive", "randomized"),
  gridsearch.randomized.p = 0.1,
  metric = NULL,
  maximize = NULL,
  probability = NULL,
  importance = "impurity",
  local.importance = FALSE,
  replace = TRUE,
  min.node.size = NULL,
  splitrule = NULL,
  strata = NULL,
  sampsize = if (replace) nrow(x) else ceiling(0.632 * nrow(x)),
  tune.do.trace = FALSE,
  imetrics = FALSE,
  n.cores = rtCores,
  print.tune.plot = FALSE,
  print.plot = FALSE,
  plot.fitted = NULL,
  plot.predicted = NULL,
  plot.theme = rtTheme,
  question = NULL,
  grid.verbose = verbose,
  verbose = TRUE,
  outdir = NULL,
  save.mod = ifelse(!is.null(outdir), TRUE, FALSE),
  ...
)

Arguments

x

Numeric vector or matrix / data frame of features i.e. independent variables

y

Numeric vector of outcome, i.e. dependent variable

x.test

Numeric vector or matrix / data frame of testing set features Columns must correspond to columns in x

y.test

Numeric vector of testing set outcome

x.name

Character: Name for feature set

y.name

Character: Name for outcome

n.trees

Integer: Number of trees to grow. Default = 1000

weights

Numeric vector: Weights for cases. For classification, weights takes precedence over ifw, therefore set weights = NULL if using ifw. Note: If weight are provided, ifw is not used. Leave NULL if setting ifw = TRUE.

ifw

Logical: If TRUE, apply inverse frequency weighting (for Classification only). Note: If weights are provided, ifw is not used.

ifw.type

Integer 0, 1, 2 1: class.weights as in 0, divided by min(class.weights) 2: class.weights as in 0, divided by max(class.weights)

ifw.case.weights

Logical: If TRUE, define ranger's case.weights using IPW. Default = TRUE Note: Cannot use case.weights together with stratify.on.y or inbag.resample

ifw.class.weights

Logical: If TRUE, define ranger's class.weights using IPW. Default = FALSE

upsample

Logical: If TRUE, upsample training set cases not belonging in majority outcome group

downsample

Logical: If TRUE, downsample majority class to match size of minority class

resample.seed

Integer: If provided, will be used to set the seed during upsampling. Default = NULL (random seed)

autotune

Logical: If TRUE, use randomForest::tuneRF to determine mtry

classwt

Vector, Float: Priors of the classes for randomForest::tuneRF if autotune = TRUE. For classification only; need not add up to 1

n.trees.try

Integer: Number of trees to train for tuning, if autotune = TRUE

stepFactor

Float: If autotune = TRUE, at each tuning iteration, mtry is multiplied or divided by this value. Default = 1.5

mtry

[gS] Integer: Number of features sampled randomly at each split. Defaults to square root of n of features for classification, and a third of n of features for regression.

mtryStart

Integer: If autotune = TRUE, start at this value for mtry

inbag.resample

List, length n.tree: Output of setup.resample to define resamples used for each tree. Default = NULL

stratify.on.y

Logical: If TRUE, overrides inbag.resample to use stratified bootstraps for each tree. This can help improve test set performance in imbalanced datasets. Default = FALSE. Note: Cannot be used with ifw.case.weights

grid.resample.params

List: Output of setup.resample defining grid search parameters.

gridsearch.type

Character: Type of grid search to perform: "exhaustive" or "randomized".

gridsearch.randomized.p

Float (0, 1): If gridsearch.type = "randomized", randomly test this proportion of combinations.

metric

Character: Metric to minimize, or maximize if maximize = TRUE during grid search. Default = NULL, which results in "Balanced Accuracy" for Classification, "MSE" for Regression, and "Coherence" for Survival Analysis.

maximize

Logical: If TRUE, metric will be maximized if grid search is run.

probability

Logical: If TRUE, grow a probability forest. See ranger::ranger. Default = FALSE

importance

Character: "none", "impurity", "impurity_corrected", or "permutation" Default = "impurity"

local.importance

Logical: If TRUE, return local importance values. Only applicable if importance is set to "permutation".

replace

Logical: If TRUE, sample cases with replacement during training.

min.node.size

[gS] Integer: Minimum node size

splitrule

Character: For classification: "gini" (Default) or "extratrees"; For regression: "variance" (Default), "extratrees" or "maxstat". For survival "logrank" (Default), "extratrees", "C" or "maxstat".

strata

Vector, Factor: Will be used for stratified sampling

sampsize

Integer: Size of sample to draw. In Classification, if strata is defined, this can be a vector of the same length, in which case, corresponding values determine how many cases are drawn from the strata.

tune.do.trace

Same as do.trace but for tuning, when autotune = TRUE

imetrics

Logical: If TRUE, calculate interpretability metrics (N of trees and N of nodes) and save under the extra field of rtMod

n.cores

Integer: Number of cores to use.

print.tune.plot

Logical: passed to randomForest::tuneRF.

print.plot

Logical: if TRUE, produce plot using mplot3 Takes precedence over plot.fitted and plot.predicted.

plot.fitted

Logical: if TRUE, plot True (y) vs Fitted

plot.predicted

Logical: if TRUE, plot True (y.test) vs Predicted. Requires x.test and y.test

plot.theme

Character: "zero", "dark", "box", "darkbox"

question

Character: the question you are attempting to answer with this model, in plain language.

grid.verbose

Logical: Passed to gridSearchLearn

verbose

Logical: If TRUE, print summary to screen.

outdir

String, Optional: Path to directory to save output

save.mod

Logical: If TRUE, save all output to an RDS file in outdir save.mod is TRUE by default if an outdir is defined. If set to TRUE, and no outdir is defined, outdir defaults to paste0("./s.", mod.name)

...

Additional arguments to be passed to ranger::ranger

Details

You should cconsider, or try, setting mtry to NCOL(x), especially for small number of features. By default mtry is set to NCOL(x) for NCOL(x) <= 20. For imbalanced datasets, setting stratify.on.y = TRUE should improve performance. If autotune = TRUE, randomForest::tuneRF will be run to determine best mtry value. [gS]: indicated parameter will be tuned by grid search if more than one value is passed

See Tech Report comparing balanced (ifw.case.weights = TRUE) and weighted (ifw.class.weights = TRUE) Random Forests.

Value

rtMod object

Author(s)

E.D. Gennatas

See Also

train_cv for external cross-validation

Other Supervised Learning: s_AdaBoost(), s_AddTree(), s_BART(), s_BRUTO(), s_BayesGLM(), s_C50(), s_CART(), s_CTree(), s_EVTree(), s_GAM(), s_GBM(), s_GLM(), s_GLMNET(), s_GLMTree(), s_GLS(), s_H2ODL(), s_H2OGBM(), s_H2ORF(), s_HAL(), s_KNN(), s_LDA(), s_LM(), s_LMTree(), s_LightCART(), s_LightGBM(), s_MARS(), s_MLRF(), s_NBayes(), s_NLA(), s_NLS(), s_NW(), s_PPR(), s_PolyMARS(), s_QDA(), s_QRNN(), s_RF(), s_RFSRC(), s_SDA(), s_SGD(), s_SPLS(), s_SVM(), s_TFN(), s_XGBoost(), s_XRF()

Other Tree-based methods: s_AdaBoost(), s_AddTree(), s_BART(), s_C50(), s_CART(), s_CTree(), s_EVTree(), s_GBM(), s_GLMTree(), s_H2OGBM(), s_H2ORF(), s_LMTree(), s_LightCART(), s_LightGBM(), s_MLRF(), s_RF(), s_RFSRC(), s_XGBoost(), s_XRF()

Other Ensembles: s_AdaBoost(), s_GBM(), s_RF()


egenn/rtemis documentation built on Nov. 22, 2024, 4:12 a.m.