In jbryer/mldash: Evaluating Machine Learning Models Across Many Datasets

library(dplyr)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
options(digits = 2)

library(mldash)
library(DT)

# Sys.setenv("RETICULATE_PYTHON" = "~/miniforge3/envs/mldash/bin/python")

`mldash`: Machine Learning Dashboard

Contact: Jason Bryer, Ph.D.
Website: https://jbryer.github.io/mldash/

The goal of mldash is to provide a framework for evaluating the performance of many predictive models across many datasets. The package includes common predictive modeling procedures and datasets. Details on how to contribute additional datasets and models is outlined below. Both datasets and models are defined in the Debian Control File (dcf) format. This provides a convenient format for storing both metadata about the datasets and models but also R code snippets for retrieving data, training models, and getting predictions. The run_models function handles executing each model for each dataset (appropriate to the predictive model type, i.e. classification or regression), splitting data into training and validation sets, and calculating the desired performance metrics utilizing the yardstick package.

Installation

You can install the development version of mldash using the remotes package like so:

remotes::install_github('jbryer/mldash')

The mldash package makes use of predictive models implemented in R, Python, and Java. As a result, there are numerous system requirements necessary to run all the models. We have included instructions in the installation vignette:

vignette('installation', package = 'mldash')

Running Predictive Models

To begin, we read in the datasets using the read_ml_datasets() function. There are two parameters:

dir is the directory containing the metadata files. The default is to look in the package's installation directory.
cache_dir is the directory where datasets can be stored locally.

This lists the datasets currenlty included in the package.

ml_datasets <- mldash::read_ml_datasets(dir = 'inst/datasets',
                                        cache_dir = 'inst/datasets')
# head(ml_datasets, n = 4)

Similarly, the read_ml_models will read in the models. The dir parameter defines where to look for model files.

ml_models <- mldash::read_ml_models(dir = 'inst/models')
# head(ml_models, n = 4)

Once the datasets and models have been loaded, the run_models will train and evaluate each model for each dataset as appropriate for the model type.

ml_results <- mldash::run_models(datasets = ml_datasets, 
                                 models = ml_models, 
                                 seed = 2112)

knitr::kable(ml_results[,c('dataset', 'model', 'type', 'time_elapsed', 'base_accuracy', 'accuracy', 'rsq')],
             row.names = FALSE)

The metrics parameter to run_models() takes a list of metrics from the yardstick package (Kuhn & Vaughan, 2021). The full list of metrics are available here: https://yardstick.tidymodels.org/articles/metric-types.html

Available Datasets

There are r nrow(ml_datasets) datasets included in the mldash package. You can view the packages in the datasets vignette.

vignette('datasets', package = 'mldash')

for(i in seq_len(nrow(ml_datasets))) {
    cat(paste0('* [', ml_datasets[i,]$name, '](https://github.com/jbryer/mldash/blob/master/inst/datasets/', ml_datasets[i,]$id, '.dcf) - ', ml_datasets[i,]$description, '\n'))
}

Available Models

Each model is defined in a Debian Control File (DCF) format the details of which are described below. Below is the list of models included in the mldash package. Note that models that begin with tm_ are models implemented with the tidymodels R package; models that begin with weka_ are models implemented with the the RWeka which is a wrapper to the Weka collection of machine learning algorithms.

There are r nrow(ml_models) models included in the mldash package. You can view the models in the models vignette.

vignette('models', package = 'mldash')

for(i in seq_len(nrow(ml_models))) {
    cat(paste0('* [', ml_models[i,]$name, '](https://github.com/jbryer/mldash/blob/master/inst/models/', row.names(ml_models)[i], '.dcf) - ', ml_models[i,]$description, '\n'))
}