View source: R/modelTrainingTuningFittingTesting.R
tune_and_train_rf_model | R Documentation |
This function uses scikit-learn's python based GridSearchCV to perform hyperparameter tuning and training of a RandomForestClassifier. It allows for customizable parameter grids and includes preprocessing steps of one-hot encoding and scaling. The function is designed to find the best hyperparameters based on accuracy. Please reference the scikit-learn GridSearchCV documentation for the full description of options, however our defaults are comprehensive.
tune_and_train_rf_model(
X,
y,
cv_folds = 5,
scoring_method = "roc_auc",
seed = 4,
param_grid = NULL,
n_jobs = 1,
n_cores = -2
)
X |
The features for the model (data frame or matrix). Usually obtained from the create_feature_matrix function. |
y |
The target variable for the model (vector). Usually obtained from the create_feature_matrix function. |
cv_folds |
The number of splits in StratifiedKFold cross validation, (default: 5) |
scoring_method |
The scoring method to be used. Options are 'accuracy', 'precision', 'recall', 'roc_auc', 'f1'... see scikit-learn GridSearchCV documentation for more info. |
seed |
The random seed for reproducibility (default: 4). |
param_grid |
An optional list of parameters for tuning the model. If NULL, a default set of parameters is used. The list should follow the format expected by GridSearchCV, with parameters requiring integers suffixed with 'L' (e.g., 10L). This is to ensure compatibility when being passed from R to Python. Default param_grid is as follows: param_grid <- list( bootstrap = list(TRUE), class_weight = list(NULL), max_depth = list(5L, 10L, 15L, 20L, NULL), n_estimators = as.integer(seq(10, 100, 10)), max_features = list("sqrt", "log2", 0.1, 0.2), criterion = list("gini"), warm_start = list(FALSE), min_samples_leaf = list(1L, 2L, 5L, 10L, 20L, 50L), min_samples_split = list(2L, 10L, 20L, 50L, 100L, 200L) ) |
n_jobs |
An optional number of jobs to specify for parallel processing. Default is 1. |
n_cores |
An optional number of cores to specify for parallel processing. Default is (-2), which is 2 less than the maximum available number of cores. |
A list containing the best hyperparameters for the model, cross-validation scores on training set, and the fitted GridSearchCV object.
library(Rf2pval)
Load conda environment, which ensures the correct version of Python and the necessary python packages can be loaded. See vignette for more details.
use_condaenv("rf2pval-conda-arm64mac", required = TRUE)
Load the demo data
data(demo_rnaseq_data)
Prepare the sample data into a format ingestible by the ML algorithm
processed_training_data <- create_feature_matrix(demo_data_rnaseq_rf$training_data, "training")
Model training (Warning: may take a long time if dataset is large and if param_grid has many options)
tuning_results <- tune_and_train_rf_model(processed_training_data$X_training_mat, processed_training_data$y_training_vector, cv_folds = 5, seed = 123, param_grid = list(max_depth = list(10L, 20L)))
print(tuning_results$best_params)
print(tuning_results$grid_search$best_score_)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.