train: Train models with forester

View source: R/train.R

trainR Documentation

Train models with forester

Description

The 'train()' function is the core function of this package. The only obligatory arguments are 'data' and 'target'. Setting and changing other arguments will affect model validation strategy, tested model families, and so on.

Usage

train(
  data,
  y = NULL,
  time = NULL,
  status = NULL,
  type = "auto",
  engine = c("ranger", "xgboost", "decision_tree", "lightgbm"),
  verbose = TRUE,
  check_correlation = TRUE,
  train_test_split = c(0.6, 0.2, 0.2),
  split_seed = NULL,
  bayes_iter = 10,
  bayes_info = list(verbose = 0, plotProgress = FALSE),
  random_evals = 10,
  parallel = TRUE,
  metrics = "auto",
  sort_by = "auto",
  metric_function = NULL,
  metric_function_name = NULL,
  metric_function_decreasing = TRUE,
  best_model_number = 5,
  custom_preprocessing = NULL
)

Arguments

data

A 'data.frame' or 'matrix' - data which will be used to build models. By default model will be trained on all columns in the 'data'.

y

A target variable, being a character name of variable in the 'data' that contains the target variable for classification and regression tasks. By default set to NULL. If you use y, don't use 'time', and 'status', which are reserved for survival analysis.

time

A target variable, being a character name of variable in the 'data' that describes the 'time' column for survival analysis task. By default set to NULL. You have to use both 'time', and 'status' together. If you use it, you cannot use 'y' as it is reserved for classification and regression tasks.

status

A target variable, being a character name of variable in the 'data' that describes the 'status' for survival analysis task. By default set to NULL. You have to use both 'time', and 'status' together. If you use it, you cannot use 'y' as it is reserved for classification and regression tasks.

type

A character, one of 'binary_clf'/'regression'/'survival'/'auto'/'multiclass' that sets the type of the task. If 'auto' (the default option) then forester will figure out 'type' based on the number of unique values in the 'y' variable, or the presence of 'time'/'status' columns.

engine

A vector of tree-based models that shall be tested. Possible values are: 'ranger', 'xgboost', 'decision_tree', 'lightgbm', 'catboost'. All models from this vector will be trained and the best one will be returned. It doesn't matter for survival analysis.

verbose

A logical value, if set to TRUE, provides all information about training process, if FALSE gives none.

check_correlation

A logical value, if set to TRUE, provides information about the correlations between numeric, and categorical pairs of variables as a part of data check. Available only when verbose is set to TRUE. Default value is TRUE.

train_test_split

A 3-value, numeric vector, describing the proportions of train, test, validation subsets to original data set. Default values are: c(0.6, 0.2, 0.2).

split_seed

An integer value describing the seed for the split into train, test, and validation datasets. By default no seed is set and the split is performed randomly. Default value is NULL.

bayes_iter

An integer value describing number of optimization rounds used by the Bayesian optimization. If set to 0 it turns off this method.

bayes_info

A list with two values, determining the verbosity of the Bayesian Optmization process. The first value is 'verbose' with 3 levels: 0 - no output; 1 - describes what is hapenning, and if we can reach local optimum; 2 - addtionally provides infromation about recent, and the best scores. The second value is 'plotProgress', which is a logical value indicating if the progress of the Bayesian Optimization should be plotted. WARNING it will create plot after each step, thus it might be computationally expensive. Both arguments come from the 'ParBayesianOptimization' package. It only matters if you set global verbose to TRUE. Default values are: list(verbose = 0, plotProgress = FALSE).

random_evals

An integer value describing number of trained models with different parameters by random search. If set to 0 it turns off this method.

parallel

A logical value indicating if the parallel method for random search and Bayesian Optimizations should be used. Unfortunately it works properly for ranger and xgboost models only. By default it is set to TRUE.

metrics

A vector of metrics names. By default param set for 'auto', most important metrics are returned. For 'all' all metrics are returned. For 'NULL' no metrics returned but still sorted by 'sort_by'.

sort_by

A string with a name of metric to sort by. For 'auto' models going to be sorted by 'mse' for regression and 'f1' for classification.

metric_function

The self-created function. It should look like name(predictions, observed) and return the numeric value. In case of using 'metrics' param with a value other than 'auto' or 'all', is needed to use a value 'metric_function' in order to see given metric in report. If 'sort_by' is equal to 'auto' models are sorted by 'metric_function'.

metric_function_name

The name of the column with values of 'metric_function' parameter. By default 'metric_function_name' is 'metric_function'.

metric_function_decreasing

A logical value indicating how metric_function should be sorted. 'TRUE' by default.

best_model_number

Number of best models to be chosen as element of the return. All trained models will be returned as different element of the return.

custom_preprocessing

An object returned by the 'custom_preprocessing()' function. By default it is set to NULL, which indicates that basic preprocessing inside the train will be executed. This process however only makes the necessary actions for the 'train()' to work properly.

Value

A list of all necessary objects for other functions. It contains:

  • `data` The original data.

  • `y` The original target column name.

  • `time` The original column name describing time for survival analysis task.

  • `status` The original column name describing status for survival analysis task.

  • `type` The type of the ML task. If the user did not specify a type in the input parameters, the algorithm recognizes, uses and returns the same type. It could be 'binary_clf', 'regression', 'survival', or 'multiclass'.

  • `deleted_columns` Column names from the original data frame that have been removed in the data preprocessing process, e.g. due to too high correlation with other columns.

  • `preprocessed_data` The data frame after the preprocessing process - that means: removing columns with one value for all rows, binarizing the target column, managing missing values and in advanced preprocessing: deleting correlated values, deleting columns that are ID-like columns and performing Boruta algorithm for selecting most important features.

  • `bin_labels` Labels of binarized target value - 1 or 2 for binary classification and NULL for regression.

  • `deleted_rows` The indexes of rows deleted during the preprocessing, if none were removed the value is NULL.

  • `models_list` The list of all trained models.

  • `check_report` Data check report held as a list of strings. It is used by the 'report()' function.

  • `outliers` The vector of possible outliers detected by the 'check_data()'.

  • `best_models_on_valid` The object containing the best performing models on the validation dataset. #'

  • `engine` The list of names of all types of trained models. Possible values: 'ranger', 'xgboost', 'decision_tree', 'lightgbm', 'catboost'.

  • `raw_train` The another form of the training dataset (useful for creating VS plot and predicting on training dataset for catboost and lightgbm models).

  • `train_data` The training dataset - the part of the source dataset after preprocessing, balancing and splitting into the training, test and validation datasets.

  • `test_data` The test dataset - the part of the source dataset after preprocessing, balancing and splitting into the training, test and validation datasets.

  • `valid_data` The validation dataset - the part of the source dataset after preprocessing, balancing and splitting into the training, test and validation datasets.

  • `train_inds` The vector of integers describing the observation indexes from the original data frame that went to the training set.

  • `test_inds` The vector of integers describing the observation indexes from the original data frame that went to the testing set.

  • `valid_inds` The vector of integers describing the observation indexes from the original data frame that went to the validation set.

  • `predictions_train` Predictions for all trained models on a train dataset.

  • `predictions_test` Predictions for all trained models on a test dataset.

  • `predictions_valid` Predictions for all trained models on a validation dataset.

  • `predictions_train_labels` Predictions for all trained models on a train dataset with human readable labels (for classification tasks only).

  • `predictions_test_labels` Predictions for all trained models on a test dataset with human readable labels (for classification tasks only).

  • `predictions_valid_labels` Predictions for all trained models on a validation dataset with human readable labels (for classification tasks only).

  • `predictions_best_train` Predictions for best trained models on a train dataset.

  • `predictions_best_test` Predictions for best trained models on a test dataset.

  • `predictions_best_valid` Predictions for best trained models on a validation dataset.

  • `predictions_best_train_labels` Predictions for best trained models on a train dataset with human readable labels (for classification tasks only).

  • `predictions_best_test_labels` Predictions for best trained models on a test dataset with human readable labels (for classification tasks only).

  • `predictions_best_valid_labels` Predictions for best trained models on a validation dataset with human readable labels (for classification tasks only).

  • `score_train` The list of metrics for all trained models calculated on a train dataset.

  • `score_test` The list of metrics for all trained models calculated on a test dataset.

  • `score_valid` The list of metrics for all trained models calculated on a validation dataset.

  • `test_observed` Values of y column from the test dataset.

  • `train_observed` Values of y column from the training dataset.

  • `valid_observed` Values of y column from the validation dataset.

  • `test_observed_labels` Values of y column from the test dataset as text labels (for classification tasks only).

  • `train_observed_labels` Values of y column from the training dataset as text labels (for classification task only).

  • `valid_observed_labels` Values of y column from the validation dataset as text labels (for classification task only).

Examples

## Not run: 
# Regression task example.
library(forester)
data('lisbon')
train_output <- train(lisbon, 'Price')
train_output$score_valid

# Survival analysis example
data('peakVO2')
train_output <- train(peakVO2, time = 'ttodead', status = 'died')
train_output$score_valid

## End(Not run)

ModelOriented/forester documentation built on June 6, 2024, 7:29 a.m.