train_forest: Train a Recforest Model

View source: R/forest.R

train_forestR Documentation

Train a Recforest Model

Description

This function trains a recforest model using the provided data and parameters.

Usage

train_forest(
  data,
  id_var,
  covariates,
  event,
  time_vars = c("t.start", "t.stop"),
  death_var = NULL,
  n_trees,
  n_bootstrap = NULL,
  seed = NULL,
  mtry,
  minsplit,
  nodesize,
  method,
  min_score,
  max_nodes,
  parallel = FALSE,
  verbose = TRUE
)

Arguments

data

A data frame containing the dataset to be used for training the model.

id_var

The name of the column containing the unique identifier for each subject.

covariates

A character vector containing the names of the columns to be used as predictors in the model.

event

The name of the column containing the recurrent event indicator.

time_vars

A length-2 character vector containing the names of the columns representing the start and stop times (default "t.start" and "t.stop").

death_var

The name of the column containing the death indicator or other any terminal event (optional).

n_trees

The number of trees to be trained in the recforest model.

n_bootstrap

The number of bootstrap samples to be used for training each tree (in-bag sample). If not provided, it is set to 2/3 of the sample size (in term of number of unique id_var).

seed

An optional seed value to be used for reproducibility purpose (NULL by default).

mtry

The number of candidate variables randomly drawn at each node of the trees. This parameter should be tuned by minimizing the OOB error.

minsplit

The minimal number of events required to split the node. Cannot be smaller than 2.

nodesize

The minimal number of subjects required in both child nodes to split. Cannot be smaller than 1.

method

The method to be used for training the model. Currently, the following methods are supported : either "NAa" for Nelson-Aalen method, with no terminal event and no longitudinal time-dependent features; either "GL" for Ghosh-Lin modelization step with a terminal event and/or at least one longitudinal time-dependent feature.

min_score

The minimum score required to split a node. This parameter is used only when the method is set to "NAa".

max_nodes

The maximum number of nodes per tree.

parallel

A logical value indicating whether to use parallel processing for training the trees.

verbose

A logical value indicating whether to print progress messages.

Details

The recforest model aggregates predictions over an ensemble of trees, each constructed using a set of decision nodes based on specific splitting rules. At each node, a subset of predictors is randomly selected, and an optimal split is determined using an appropriate statistical test. Depending on the specified method, the algorithm employs different statistical tests to find the best split:

  • For standard recurrent event data, the pseudo-score test statistic is used to compare two Nelson-Aalen estimates of the mean cumulative function.

  • In the presence of terminal events and/or longitudinal variables, the Ghosh-Lin model is utilized to obtain the Wald test statistic, which provides a more accurate assessment of the split. The trees grow until they meet the stopping criteria, which include a minimum number of events (minsplit) and a minimum number of individuals in terminal nodes (nodesize). The final model is an ensemble of these trees, which helps to reduce overfitting and improve predictive performance by averaging the results on the out-of-bag sample.

Value

A list containing the following elements:

trees

A list of trained trees.

tree_metrics

A list of metrics for each tree.

metrics

A summary of the metrics for all trees.

columns

A list of column names used in the training.

params

A list of parameters used to set the model.

n_indiv

Number of individuals in the dataset.

n_predictors

Number of predictors used in the model.

n_trees

Number of trees trained.

n_bootstrap

Number of bootstrap samples used to grow each tree.

time

Computation time used to train the model.

References

Cook, R. J., & Lawless, J. F. (1997). Marginal analysis of recurrent events and a terminating event. Statistics in medicine, 16(8), 911-924.

Ghosh, D., & Lin, D. Y. (2002). Marginal regression models for recurrent and terminal events. Statistica Sinica, 663-688.

Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (2008). Random survival forests.

Examples

data("bladder1_recforest")
trained_forest <- train_forest(
  data = bladder1_recforest,
  id_var = "id",
  covariates = c("treatment", "number", "size"),
  time_vars = c("t.start", "t.stop"),
  death_var = "death",
  event = "event",
  n_trees = 2,
  n_bootstrap = 70,
  mtry = 2,
  minsplit = 3,
  nodesize = 15,
  method = "NAa",
  min_score = 5,
  max_nodes = 20,
  seed = 111,
  parallel = FALSE,
  verbose = FALSE
)
print(trained_forest)
summary(trained_forest)

recforest documentation built on April 12, 2025, 9:17 a.m.