s_MLRF: Spark MLlib Random Forest (C, R)
In egenn/rtemis: Machine Learning and Visualization

s_MLRF

R Documentation

Spark MLlib Random Forest (C, R)

Description

Train an MLlib Random Forest model on Spark

Usage

s_MLRF(
  x,
  y = NULL,
  x.test = NULL,
  y.test = NULL,
  upsample = FALSE,
  downsample = FALSE,
  resample.seed = NULL,
  n.trees = 500L,
  max.depth = 30L,
  subsampling.rate = 1,
  min.instances.per.node = 1,
  feature.subset.strategy = "auto",
  max.bins = 32L,
  x.name = NULL,
  y.name = NULL,
  spark.master = "local",
  print.plot = FALSE,
  plot.fitted = NULL,
  plot.predicted = NULL,
  plot.theme = rtTheme,
  question = NULL,
  verbose = TRUE,
  trace = 0,
  outdir = NULL,
  save.mod = ifelse(!is.null(outdir), TRUE, FALSE),
  ...
)

Arguments

`x`	vector, matrix or dataframe of training set features
`y`	vector of outcomes
`x.test`	vector, matrix or dataframe of testing set features
`y.test`	vector of testing set outcomes
`upsample`	Logical: If TRUE, upsample cases to balance outcome classes (for Classification only) Note: upsample will randomly sample with replacement if the length of the majority class is more than double the length of the class you are upsampling, thereby introducing randomness
`downsample`	Logical: If TRUE, downsample majority class to match size of minority class
`resample.seed`	Integer: If provided, will be used to set the seed during upsampling. Default = NULL (random seed)
`n.trees`	Integer: Number of trees to train
`max.depth`	Integer: Max depth of each tree
`subsampling.rate`	Numeric: Fraction of cases to use for training each tree
`min.instances.per.node`	Integer: Min N of cases per node.
`feature.subset.strategy`	Character: The number of features to consider for splits at each tree node. Supported options: "auto" (choose automatically for task: If numTrees == 1, set to "all." If numTrees > 1 (forest), set to "sqrt" for classification and to "onethird" for regression), "all" (use all features), "onethird" (use 1/3 of the features), "sqrt" (use sqrt(number of features)), "log2" (use log2(number of features)), "n": (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). Default is "auto".
`max.bins`	Integer. Max N of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
`x.name`	Character: Name for feature set
`y.name`	Character: Name for outcome
`spark.master`	Spark cluster URL or "local"
`print.plot`	Logical: if TRUE, produce plot using `mplot3` Takes precedence over `plot.fitted` and `plot.predicted`.
`plot.fitted`	Logical: if TRUE, plot True (y) vs Fitted
`plot.predicted`	Logical: if TRUE, plot True (y.test) vs Predicted. Requires `x.test` and `y.test`
`plot.theme`	Character: "zero", "dark", "box", "darkbox"
`question`	Character: the question you are attempting to answer with this model, in plain language.
`verbose`	Logical: If TRUE, print summary to screen.
`trace`	Integer: If higher than 0, will print more information to the console.
`outdir`	Path to output directory. If defined, will save Predicted vs. True plot, if available, as well as full model output, if `save.mod` is TRUE
`save.mod`	Logical: If TRUE, save all output to an RDS file in `outdir` `save.mod` is TRUE by default if an `outdir` is defined. If set to TRUE, and no `outdir` is defined, outdir defaults to `paste0("./s.", mod.name)`
`...`	Additional arguments

Details

The overhead incurred by Spark means this is best used for larged datasets on a Spark cluster.

Value

rtMod object

Author(s)

E.D. Gennatas

Other Supervised Learning: s_AdaBoost(), s_AddTree(), s_BART(), s_BRUTO(), s_BayesGLM(), s_C50(), s_CART(), s_CTree(), s_EVTree(), s_GAM(), s_GBM(), s_GLM(), s_GLMNET(), s_GLMTree(), s_GLS(), s_H2ODL(), s_H2OGBM(), s_H2ORF(), s_HAL(), s_Isotonic(), s_KNN(), s_LDA(), s_LM(), s_LMTree(), s_LightCART(), s_LightGBM(), s_MARS(), s_NBayes(), s_NLA(), s_NLS(), s_NW(), s_PPR(), s_PolyMARS(), s_QDA(), s_QRNN(), s_RF(), s_RFSRC(), s_Ranger(), s_SDA(), s_SGD(), s_SPLS(), s_SVM(), s_TFN(), s_XGBoost(), s_XRF()

Other Tree-based methods: s_AdaBoost(), s_AddTree(), s_BART(), s_C50(), s_CART(), s_CTree(), s_EVTree(), s_GBM(), s_GLMTree(), s_H2OGBM(), s_H2ORF(), s_LMTree(), s_LightCART(), s_LightGBM(), s_RF(), s_RFSRC(), s_Ranger(), s_XGBoost(), s_XRF()