README.md

Decision Tree Tuning Analysis

'DecisionTreeTuningAnalysis' is an automated R code used to generate automated graphical analysis of our paper 'Better Trees: An empirical study on hyperparameter tuning of classification decision trees' [01]. The automated analysis coded here handles data generated by our hyperparameter tuning project (HpTuning) but may be easily extended. The main features available cover the hyperparameter profile of the decision tree induction algorithms, i.e. , answering the following questions:

Installation

The installation process is via git clone. You should use the following command inside your terminal session:

git clone https://github.com/rgmantovani/DecisionTreeTuningAnalysis

General instructions

The classification algorithms analyzed must follow 'mlr' R package implementation [02]. A complete list of the available learners may be found here. The code generated here provides results for two decision tree induction algorithms: J48 (classif.J48) and CART (classif.rpart).

Hyperparameter tuning results should be placed in the data/hptuning_full_space/<algorithm.name>/results sub-directory. We did not upload raw results since they have more than 50GB of data (But you can download it from here). Thus, we developed some scripts to extract useful information from the executed jobs. These scripts are in the scripts folder. The automated analysis will only work if these scripts have run before. This is also checked by the automated code and returned to the user with instructions on how to proceed. There are 4 auxiliary scripts:

All extraction scripts require the algorithm's name as a parameter (<algorithm.name>). There is no order to run these scripts, but all of them must be executed. The files generated by these scripts will be later read and aggregated as data.frame objects and used by the automated code.

A - Extracting main results

cd script
Rscript 01_extractRepResults.R --algo=<algorithm.name> &

# examples:
# Rscript 01_extractRepResults.R --algo="classif.J48" &
# Rscript 01_extractRepResults.R --algo="classif.rpart" &

B - Extracting optimization paths

cd script
Rscript 02_extractOptPaths.R --algo=<algorithm.name> &

# examples:
# Rscript 02_extractOptPaths.R --algo="classif.J48" &
# Rscript 02_extractOptPaths.R --algo="classif.rpart" &

C - Extracting models' statistics

cd script
Rscript 03_extractModelStats.R --algo=<algorithm.name> &

# examples:
# Rscript 03_extractModelStats.R --algo="classif.J48" &
# Rscript 03_extractModelStats.R --algo="classif.rpart" &

D - FAnova hyperparameter marginal predictions

FAnova marginal predictions are obtained by an external project [03]. This our script will generate input files in the pattern required by the FAnova Python script. To run it:

cd scripts
Rscript 04_createFanovaInputs.R --algo=<algorithm.name> &

# examples:
# Rscript 04_createFanovaInputs.R --algo="classif.J48" &
# Rscript 04_createFanovaInputs.R --algo="classif.rpart" &

The output will be placed in a folder named data/hptuning_full_space/<algorithm.name>/fanova_input, with one file per dataset. Provide these files to the external project, and it will also generate one correspondent file per dataset. These new files should be placed in the data/hptuning_full_space/<algorithm.name>/fanova_output sub-directory.

Running the code

To run the project, please call it by the following command:

 Rscript 01_mainAnalysis.R --algo=<algorithm.name> &

 # examples:
 # Rscript 01_mainAnalysis.R --algo="classif.rpart" &
 # Rscript 01_mainAnalysis.R --algo="classif.J48"   &

Meta-level results are independent and can be generated by:

 Rscript 02_metaAnalysis.R &

Meta-level results are independent and can be generated by:

Contact

Rafael Gomes Mantovani (rgmantovani@gmail.com / rafaelmantovani@utfpr.edu.br), Federal Technology University - Paraná (UTFPR) - Apucarana - PR, Brazil.

References

[01] Rafael Gomes Mantovani, Tomas Horvath, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. Carvalho. Better Trees: An empirical study on hyperparameter tuning of classification decision trees. Data Min Knowl Disc (2024). https://doi.org/10.1007/s10618-024-01002-5.

[02] B. Bischl, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, Zachary Jones. mlr: Machine Learning in R. Journal of Machine Learning in R, v.17, n.170, 2016, pgs 1-5.

[03] F. Hutter, H. Hoos, K. Leyton-Brown. An Efficient Approach for Assessing Hyperparameter Importance. In: Proceedings of the 31th International Conference on Machine Learning, ICMC 2014, Beijing, China, 2014, pgs 754-762.



rgmantovani/TuningAnalysis documentation built on Feb. 11, 2024, 6:07 p.m.