knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette serves as a quick-start guide for determining the optimal number of trees in a random forest using the optRF
package. The optRF
package was created to model the relationship between the number of trees and the resulting stability of the random forest and to use this model to determine the optimal number of trees from where on adding additional trees to the random forest would increase the computation time but would only marginally increase the stability. The optRF
package depends on ranger
for efficient random forest implementation. To demonstrate its functionality, we will use the SNPdata
data set included in the optRF
package. The SNPdata
data set contains in the first column the yield of 250 wheat individuals as well as 5000 genomic markers (so called SNP markers) that can contain either the value 0 or 2:
library(ranger) library(optRF) SNPdata[1:5, 1:5]
opt_prediction
functionload("optRF_vignette_predData.Rda")
To determine the optimal number of trees for random forest predictions, the opt_prediction
function from the optRF
package can be used. For a detailed description of the opt_prediction
function, please refer to its vignette
.
As an example, let's use the first 200 rows of the SNPdata
data set as the training data and the last 50 rows as the test data:
Training = SNPdata[1:200,] # Rows 1 to 200 as training data Test = SNPdata[201:250,-1] # Rows 201 to 250 as test data, excluding the response column (column 1)
To make the code reproducible, we will set a seed of 123
before using the opt_prediction
function. This function identifies the optimal number of trees required for random forest to predict the response variable in the test data set. The response variable from the training data set is passed to the y
argument, the predictor variables from the training data set are inserted in the X
argument, and the predictor variables from the test data set are specified in the X_Test
argument. Let's also assume that we aim to select the 5 top performing individuals from the test data set which translates to the top 10% which has to be inserted in the alpha
argument:
set.seed(123) # Set a seed for reproducibility optRF_result = opt_prediction(y=Training[,1], X=Training[,-1], X_Test=Test, alpha=0.1)
After running the opt_prediction
function, the recommended number of trees and the expected stability of the random forest model can be easily accessed using the summary
function:
summary(optRF_result)
Here, the optimal number of trees to receive stable predictions is 19,000 trees. Using random forest with this number of trees, one can expect a prediction stability of 0.98 which describes the correlation of the predicted response in repeated runs of random forest.
With the optimal number of trees identified, random forest can now be used to predict the response variable in the test data set. For this, the ranger
function can be used:
RF_model = ranger(y=Training[,1], x=Training[,-1], write.forest = TRUE, num.trees=optRF_result$recommendation) predictions = predict(RF_model, data=Test)$predictions predicted_Test_data = data.frame(ID = row.names(Test), predicted_response = predictions)
The predictions for the test data set are now stored in the predicted_Test_data
data frame which includes the IDs and their corresponding predicted responses.
Predicted values from random forest can support decision-making, such as selecting the top-performing individuals in a data set. To proceed, it is essential to define the number of top-performing individuals to be selected. For instance, suppose we aim to select the top 5 performers from a test data set containing 50 individuals.
To evaluate the stability of random forest in supporting this selection decision, repeated runs of the model are performed. In each run, the top individuals are classified as selected
, while the remaining individuals are classified as rejected
. The selection stability is then quantified using Fleiss’ Kappa, which measures the consistency of these classifications across the repeated runs.
To determine the optimal number of trees for selection, the argument recommendation
has to be set to "selection"
:
set.seed(123) # Set a seed for reproducibility optRF_result_2 = opt_prediction(y=Training[,1], X=Training[,-1], X_Test=Test, alpha=0.1, recommendation="selection")
The recommended number of trees and the expected stability can be accessed as above:
summary(optRF_result_2)
One can see that the recommended number of trees was increased to 73,000 trees. Random forest can now be used with this number of trees to ensure stable predictions with a prediction stability of 0.993 and prediction based decisions with a selection stability of 0.837. In order to demonstrate the robustness of the predictions and decision-making process, the decisions from random forest should always be published together with the prediction and selection stability which can be seen in the output from optRF_result$expected_RF_stability
above.
opt_importance
functionload("optRF_vignette_impData.Rda")
In addition to predictions, random forest can be used to estimate variable importance and select the most relevant variables in a data set. For variable selection, the opt_importance
function from the optRF
package can help determine the optimal number of trees for random forest. For a detailed explanation of the opt_importance
function, please refer to its vignette
.
To make the code reproducible, we will set a seed of 123
before using the opt_importance
function. This function identifies the optimal number of trees required to reliably estimate the importance of variables in a data set:
set.seed(123) # Set a seed for reproducibility optRF_result = opt_importance(y=SNPdata[,1], X=SNPdata[,-1])
Once the opt_importance
function is executed, the recommended number of trees and the stability metrics can be easily accessed:
summary(optRF_result)
The results indicate that the optimal number of trees for estimating variable importance is 40,000. With this configuration, the variable importance stability is 0.96 which reflects the correlation of variable importance estimates across repeated runs of random forest.
With the optimal number of trees determined, random forest can now be used to estimate variable importance. This can be achieved using the ranger
function with permutation importance:
RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation, write.forest = TRUE, importance="permutation") D_VI = data.frame(variable = names(SNPdata)[-1], importance = RF_model$variable.importance)
The variable importance results are now saved in the data frame D_VI
which contains the variable names and the corresponding importance estimates.
Now that we have obtained stable estimates of variable importance, we can use these estimates to select the most important variables from the data set. However, before proceeding, it is necessary to estimate how many variables in the data set significantly affect the response variable.
One approach to achieve this is by visualising the distribution of variable importance estimates. A histogram provides an intuitive way to identify patterns such as a clear separation between important and less important variables:
hist(D_VI$importance, xlim=c(-10, 50), main="Histogram of variable importances", xlab="")
From the histogram, it is apparent that the majority of variables have importance values between -5 and 5. However, a small subset of variables exceeds an importance threshold of 5. These variables are likely to have a genuine impact on the response and should be selected for further analysis.
Based on the histogram, we calculate the number of variables with importance values greater than 5:
selection_size = sum(RF_model$variable.importance>5)
Using this number of selected variables, the optimal number of trees for random forest can be determined with the opt_importance
function:
set.seed(123) # Set a seed for reproducibility optRF_result_2 = opt_importance(y=SNPdata[,1], X=SNPdata[,-1], recommendation = "selection", alpha = selection_size)
With the recommended number of trees, we re-estimate variable importance and select the top variables:
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result_2$recommendation, write.forest = TRUE, importance="permutation") D_VI_2 = data.frame(variable = names(SNPdata)[-1], importance = RF_model_2$variable.importance) D_VI_2 = D_VI_2[order(D_VI_2$importance, decreasing=TRUE),] selected_variables = D_VI_2[1:selection_size,1]
The vector selected_variables
now contains the names of the most important variables in the data set.
To enhance reproducibility and reliability, the selection results should be published along with the variable importance stability and selection stability metrics. These can again be easily accessed:
summary(optRF_result_2)
measure_stability
functionload("optRF_vignette_stabilityData.Rda")
While optimising the number of trees is generally recommended for making predictions or estimating the importance of variables using random forest, the optRF
package also allows users to assess the stability of a random forest model with a fixed number of trees. This can be done using the measure_stability
function. Hereby, it must be indicated if random forest is used for predictions or to estimate variable importance by setting the method
argument either to prediction
or importance
.
For instance, if random forest is set to run with 5,000 trees for prediction, stability can be evaluated as follows:
set.seed(123) stability_prediction = measure_stability(y = Training[,1], X=Training[,-1], X_Test=Test, num.trees=5000, method="prediction")
stability_prediction
In this case, the prediction stability is 0.948 and the selection stability is 0.798.
If random forest is set to run with 5,000 trees to estimate the variable importance, stability can be evaluated as:
set.seed(123) stability_importance = measure_stability(y = Training[,1], X=Training[,-1], X_Test=Test, num.trees=5000, method="importance")
stability_importance
In this case, the variable importance stability is 0.732 and the selection stability is 0.37. The stability of random forest should always be published alongside the results.
Random forest can be effectively used to predict the response variable in a test data set where only the predictor variables are known, and the predicted response values can be used to identify the top-performing individuals. Similarly, random forest can estimate the importance of variables in a data set and facilitate the selection of the most important variables. However, it must be noted that random forest is a non-deterministic method, meaning it can yield different results even when run on the same data set. To ensure the reproducibility and reliability of the results, it is essential to determine the optimal number of trees for random forest and use this to guide decision-making. Researchers are encouraged to evaluate the stability of their chosen tree count to ensure reliable and reproducible results when using random forest.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.