README.md
In pmlbr: Interface to the Penn Machine Learning Benchmarks Data Repository

pmlbr

pmlbr is an R interface to the Penn Machine Learning Benchmarks (PMLB) data repository, a large collection of curated benchmark datasets for evaluating and comparing supervised machine learning algorithms. These datasets cover a broad range of applications including binary/multi-class classification and regression problems as well as combinations of categorical, ordinal, and continuous features.

This repository is originally forked from makeyourownmaker/pmlblite. We thank the pmlblite’s author for releasing the source code under the GPL-2 license so that others could reuse the software.

This package works for any recent version of R.

You can install the released version of pmlbr from CRAN with:

install.packages("pmlbr")

Or the development version from GitHub with remotes:

# install.packages('remotes') # uncomment to install remotes
library(remotes)
remotes::install_github("EpistasisLab/pmlbr")

The core function of this package is fetch_data that allows us to download data from the PMLB repository. For example:

library(pmlbr)

# Download features and labels for penguins dataset in single data frame
penguins <- fetch_data("penguins")

## Download successful.

str(penguins)

## 'data.frame':    333 obs. of  8 variables:
##  $ island           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex              : int  1 0 0 0 1 0 1 0 1 1 ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ target           : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
##   ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

# Download features and labels for penguins dataset in separate data structures
penguins <- fetch_data("penguins", return_X_y = TRUE)

## Download successful.

head(penguins$x) # data frame

##   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## 1      2           39.1          18.7               181        3750   1 2007
## 2      2           39.5          17.4               186        3800   0 2007
## 3      2           40.3          18.0               195        3250   0 2007
## 4      2             NA            NA                NA          NA  NA 2007
## 5      2           36.7          19.3               193        3450   0 2007
## 6      2           39.3          20.6               190        3650   1 2007

head(penguins$y) # vector

## [1] 0 0 0 0 0 0

Let’s check other available datasets and their summary statistics:

# Dataset names
sample(classification_datasets(), 9)

## [1] "heart_disease_hungarian"       "fars"                         
## [3] "allrep"                        "_deprecated_colic"            
## [5] "Hill_Valley_without_noise"     "_deprecated_german"           
## [7] "sleep"                         "_deprecated_cleveland_nominal"
## [9] "analcatdata_happiness"

sample(regression_datasets(), 9)

## [1] "527_analcatdata_election2000" "1089_USCrime"                
## [3] "feynman_III_8_54"             "225_puma8NH"                 
## [5] "657_fri_c2_250_10"            "strogatz_glider2"            
## [7] "611_fri_c3_100_5"             "586_fri_c3_1000_25"          
## [9] "650_fri_c0_500_50"

# Dataset summaries
sum_stats <- summary_stats()
head(sum_stats)

##                dataset n_instances n_features n_binary_features
## 1             1027_ESL         488          4                 0
## 2             1028_SWD        1000         10                 1
## 3             1029_LEV        1000          4                 0
## 4             1030_ERA        1000          4                 0
## 5         1089_USCrime          47         13                 1
## 6 1096_FacultySalaries          50          4                 1
##   n_categorical_features n_continuous_features endpoint_type n_classes
## 1                      4                     0    continuous         9
## 2                      9                     0    continuous         4
## 3                      4                     0    continuous         5
## 4                      0                     4    continuous         9
## 5                      0                    12    continuous        42
## 6                      0                     3    continuous        39
##     imbalance       task
## 1 0.099363200 regression
## 2 0.108290667 regression
## 3 0.111245000 regression
## 4 0.031251250 regression
## 5 0.002970111 regression
## 6 0.004063158 regression

Selecting a subset of datasets that satisfy certain conditions is straight forward with dplyr. For example, if we need datasets with fewer than 100 observations for a classification task:

library(dplyr)
sum_stats %>%
  filter(n_instances < 100, task == "classification") %>%
  pull(dataset)

##  [1] "analcatdata_aids"           "analcatdata_asbestos"      
##  [3] "analcatdata_bankruptcy"     "analcatdata_cyyoung8092"   
##  [5] "analcatdata_cyyoung9302"    "analcatdata_fraud"         
##  [7] "analcatdata_happiness"      "analcatdata_japansolvent"  
##  [9] "confidence"                 "labor"                     
## [11] "lupus"                      "parity5"                   
## [13] "postoperative_patient_data"

All data sets are stored in a common format:

First row is the column names
Each following row corresponds to an individual observation
The target column is named target
All columns are tab (\t) separated
All files are compressed with gzip to conserve space

This R library includes summaries of the classification and regression data sets but does not store any of the PMLB data sets. The data sets can be downloaded using the fetch_data function which is similar to the corresponding PMLB python function.

Further info:

?fetch_data
?summary_stats

If you use PMLB in a scientific publication, please consider citing one of the following papers:

Joseph D. Romano, Le, Trang T., William La Cava, John T. Gregg, Daniel J. Goldberg, Praneel Chakraborty, Natasha L. Ray, Daniel Himmelstein, Weixuan Fu, and Jason H. Moore. PMLB v1.0: an open source dataset collection for benchmarking machine learning methods. arXiv preprint arXiv:2012.00058 (2020).

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore (2017). PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, page 36.