Code and manual of the endoR R-package. A Python version is currently a work in progress and I hope to release it soon - any help welcome ;)
Article: Albane Ruaud, Niklas A Pfister, Ruth E Ley, Nicholas D Youngblut. Interpreting tree ensemble machine learning models with endoR. Plos Computational Biology (2022). DOI: 10.1371/journal.pcbi.1010714
Motivation: Tree ensemble machine learning models are increasingly used in microbiome science to explore associations between microbes and their environment, as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. The complex structure of such models enables efficient capture of higher-order interactions to improve predictions, but makes the final model often difficult to interpret. Therefore, while tree ensembles are often the most accurate models for microbiome data, the approach often yields limited insight into how microbial taxa or genomic content may be associated with host phenotypes.
Procedure: endoR is a method that extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. The fitted model is simplified into a decision ensemble and then reduced via regularization and bootstrapping to retain only the most essential aspects. Information about predictive variables and their pairwise interactions are extracted from the decision ensemble and displayed as a network for straightforward interpretation. The network and importance scores derived from endoR help understand how tree ensemble models make predictions from variables and interactions between variables.
Installation, with dependencies = FALSE should work to avoid updating everything: devtools::install_github(repo = "aruaud/endoR", dep = FALSE)
Before: select and fit a machine learning model (regression or classification) - random forest R-packages: randomForest, ranger - gradient boosted model R-packages: GBM, XGBoost
model2DE()
model2DE_resampling()
preCluster()
followed by model2DE_cluster()
iterated on partitions in the Q()
function of the clusterMQ R-packagestabilitySelection()
. Alternatively, use evaluateAlpha()
to perform stability selection with various values of alpha
and pick the decision ensemble that can predict as many samples as possible for the lowest alpha
.plotFeatures()
and plotNetwork()
(created with ggplot2, ggraph and igraph R-packages, arranged with the ggpubr R-package)Some common issues:
- If the installation doesn't work, you may need to install R version 4.0.3 and use the inTrees R-package version 1.3
- endoR uses text patterns: make sure to remove all special characters from your column names before using endoR, colnames(data) <- compatibleNames(colnames(data))
- Sometimes it fails if the data.table
and inTrees
packages are not explicitely imported
- When using the caret
package for model training, be sure to feed the caret-modified training data to endoR: data_caret <- model.matrix(target ~ ., data = data)
. Caret perform dummification of multiclass variables before training, which creates 1 variable per level of the original factor. The trees are then fitted on these 0-1 encoded variables and endoR needs to know to read it properly ;)
From tree ensemble to decision ensembles
A decision consists of a full branch of a decision tree: - all splits along the branch form the rule of the decision; - the average response variable in the subset of observations following the rule corresponds to the prediction of the decision; - the sample support is the set of observations that follow the rule; - the error is the average deviation of the prediction to the observed resonse variable.
Decision importance
We measure the decision importance by comparing the error of the decision to what could be expected by a random guessing, i.e., the mean response variable across all samples. Respectively for regressions and classifications, this corresponds to calculating the coefficient of determination R2 and Cohen's kappa. We weight
Feature and interaction importance = how much a feature, or pair of features, contributes to the accuracy of predictions
Feature and interaction influence = how a feature, or pair of features, changes precisions (e.g., increases the predicted age or changes the prediction towards "disease" vs "healthy" state)
Regularization to obtain a stable decision ensemble that is generalizable - decision pruning = remove unimportant variables from decision rules (i.e. the ones that do not increase the decision error much than a threshold) - bootstrapping and stability selection = select decisions measured as important across bootstrap resamples (i.e, across bootstraps of observations/samples)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.