cuml_rand_forest: Train a random forest model.
In cuml4r: R Interface for the RAPIDS cuML Suite of Libraries

Description Usage Arguments Value Examples

View source: R/rand_forest.R

Train a random forest model for classification or regression tasks.

cuml_rand_forest(
  x,
  y = NULL,
  formula = NULL,
  mode = c("classification", "regression"),
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  bootstrap = TRUE,
  max_depth = 16,
  max_leaves = -1,
  max_predictors_per_note_split = NULL,
  n_bins = 128,
  min_samples_leaf = 1,
  split_criterion = NULL,
  min_impurity_decrease = 0,
  max_batch_size = 128,
  n_streams = 8,
  cuml_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace")
)

`x`	The input matrix or dataframe. Each data point should be a row and should consist of numeric values only.
`y`	A numeric vector of desired responses.
`formula`	If 'x' is a dataframe, then a R formula syntax of the form '<response col> ~ .' or '<response col> ~ <predictor 1> + <predictor 2> + ...' may be used to specify the response column and the predictor column(s).
`mode`	Type of task to perform. Should be either "classification" or "regression".
`mtry`	The number of predictors that will be randomly sampled at each split when creating the tree models. Default: the square root of the total number of predictors.
`trees`	An integer for the number of trees contained in the ensemble. Default: 100.
`min_n`	An integer for the minimum number of data points in a node that are required for the node to be split further. Default: 2.
`bootstrap`	Whether to perform bootstrap. If TRUE, each tree in the forest is built on a bootstrapped sample with replacement. If FALSE, the whole dataset is used to build each tree.
`max_depth`	Maximum tree depth. Default: 16.
`max_leaves`	Maximum leaf nodes per tree. Soft constraint. Default: -1 (unlimited).
`max_predictors_per_note_split`	Number of predictor to consider per node split. Default: square root of the total number predictors.
`n_bins`	Number of bins used by the split algorithm. Default: 128.
`min_samples_leaf`	The minimum number of data points in each leaf node. Default: 1.
`split_criterion`	The criterion used to split nodes, can be "gini" or "entropy" for classifications, and "mse" or "mae" for regressions. Default: "gini" for classification; "mse" for regression.
`min_impurity_decrease`	Minimum decrease in impurity requried for node to be spilt. Default: 0.
`max_batch_size`	Maximum number of nodes that can be processed in a given batch. Default: 128.
`n_streams`	Number of CUDA streams to use for building trees. Default: 8.
`cuml_log_level`	Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off.

A random forest classifier / regressor object that can be used with the 'predict' S3 generic to make predictions on new data points.

library(cuml4r)

# Classification

model <- cuml_rand_forest(
  iris,
  formula = Species ~ .,
  mode = "classification",
  trees = 100
)

predictions <- predict(model, iris)

print(predictions)

cat(
  "Number of correct predictions: ",
  sum(predictions == iris[, "Species"]),
  "\n"
)

# Regression

model <- cuml_rand_forest(
  iris,
  formula = Species ~ .,
  mode = "regression",
  trees = 100
)

predictions <- predict(model, iris)

print(predictions)
print(round(predictions))

cat(
  "Number of correct predictions: ",
  sum(as.integer(round(predictions)) == as.integer(iris[, "Species"])),
  "\n"
)