cuml_rand_forest: Train a random forest model.

Description Usage Arguments Value Examples

View source: R/rand_forest.R

Description

Train a random forest model for classification or regression tasks.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
cuml_rand_forest(
  x,
  y = NULL,
  formula = NULL,
  mode = c("classification", "regression"),
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  bootstrap = TRUE,
  max_depth = 16,
  max_leaves = -1,
  max_predictors_per_note_split = NULL,
  n_bins = 128,
  min_samples_leaf = 1,
  split_criterion = NULL,
  min_impurity_decrease = 0,
  max_batch_size = 128,
  n_streams = 8,
  cuml_log_level = c("off", "critical", "error", "warn", "info", "debug", "trace")
)

Arguments

x

The input matrix or dataframe. Each data point should be a row and should consist of numeric values only.

y

A numeric vector of desired responses.

formula

If 'x' is a dataframe, then a R formula syntax of the form '<response col> ~ .' or '<response col> ~ <predictor 1> + <predictor 2> + ...' may be used to specify the response column and the predictor column(s).

mode

Type of task to perform. Should be either "classification" or "regression".

mtry

The number of predictors that will be randomly sampled at each split when creating the tree models. Default: the square root of the total number of predictors.

trees

An integer for the number of trees contained in the ensemble. Default: 100.

min_n

An integer for the minimum number of data points in a node that are required for the node to be split further. Default: 2.

bootstrap

Whether to perform bootstrap. If TRUE, each tree in the forest is built on a bootstrapped sample with replacement. If FALSE, the whole dataset is used to build each tree.

max_depth

Maximum tree depth. Default: 16.

max_leaves

Maximum leaf nodes per tree. Soft constraint. Default: -1 (unlimited).

max_predictors_per_note_split

Number of predictor to consider per node split. Default: square root of the total number predictors.

n_bins

Number of bins used by the split algorithm. Default: 128.

min_samples_leaf

The minimum number of data points in each leaf node. Default: 1.

split_criterion

The criterion used to split nodes, can be "gini" or "entropy" for classifications, and "mse" or "mae" for regressions. Default: "gini" for classification; "mse" for regression.

min_impurity_decrease

Minimum decrease in impurity requried for node to be spilt. Default: 0.

max_batch_size

Maximum number of nodes that can be processed in a given batch. Default: 128.

n_streams

Number of CUDA streams to use for building trees. Default: 8.

cuml_log_level

Log level within cuML library functions. Must be one of "off", "critical", "error", "warn", "info", "debug", "trace". Default: off.

Value

A random forest classifier / regressor object that can be used with the 'predict' S3 generic to make predictions on new data points.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
library(cuml4r)

# Classification

model <- cuml_rand_forest(
  iris,
  formula = Species ~ .,
  mode = "classification",
  trees = 100
)

predictions <- predict(model, iris)

print(predictions)

cat(
  "Number of correct predictions: ",
  sum(predictions == iris[, "Species"]),
  "\n"
)

# Regression

model <- cuml_rand_forest(
  iris,
  formula = Species ~ .,
  mode = "regression",
  trees = 100
)

predictions <- predict(model, iris)

print(predictions)
print(round(predictions))

cat(
  "Number of correct predictions: ",
  sum(as.integer(round(predictions)) == as.integer(iris[, "Species"])),
  "\n"
)

cuml4r documentation built on July 26, 2021, 9:06 a.m.