library(mlr3oml) logger = lgr::get_logger("mlr3oml") logger$set_threshold("warn") options(mlr3oml.cache = FALSE)
This tutorial will give you a quick overview of the main features of mlr3oml
.
If you are not familiar with OpenML, we recommend to read its documentation first, as we will not explain the OpenML concepts in detail here.
Further coverage of some selected mlr3oml
features can be found in the Large-Scale Benchmarking chapter of the mlr3book
.
Note that mlr3oml
currently only supports downloading objects from OpenML.
Uploading can for example be achieved through the website.
First, we will briefly cover the different OpenML objects that can be downloaded using mlr3oml
.
Then we will show how to find objects with certain properties on OpenML.
Finally, we will quickly discuss some further aspects of mlr3oml
, which includes caching, file formats, laziness, the logger, and the API key.
mlr3oml
supports five different types of OpenML objects that are listed below.
All objects can be converted to their corresponding mlr3
pandeaunt.
OMLData
represents an OpenML dataset. These are (usually tabular) sets with additional meta-data, which includes e.g. a description of the dataset or a license.
The most similar mlr3
class is the mlr3::DataBackend
.OMLTask
represents an OpenML task. This is a concrete problem speficiation on top of an OpenML dataset.
While being similar to mlr3::Task
objects, a major difference is that the OpenML task also contains the resampling splits and can therefore also be converted to an mlr3::Resampling
.OMLFlow
represents an OpenML flow. This is a reusable and executable representation of a machine learning pipeline or workflow.
The closest mlr3
class is the Learner
.OMLRun
represents an OpenML run. An OpenML run refers to the execution of a specific machine learning flow on a particular task, recording all relevant information such as hyperparameters, performance metrics, and intermediate results.
This is similar to an mlr3::ResampleResult
object.OMLCollection
represents an OpenML collection, which can either be a run collection or a task collection.
These are container objects that allow to bundle tasks (resulting in benchmarking suites) or runs (which can be used to represent benchmark experiments).
There is no mlr3
pendant for the former (other than a list of tasks), while the latter would correspond to an mlr3::BenchmarkResult
.Each object on OpenML has a unique identifier, by which it can be retrieved. We will now briefly show how to access and work with these objects.
Below, we retrieve the dataset with ID 31, which is the credit-g data and can be viewed online here.
Like in other mlr3
packages, sugar functions exist for the construction of R6
classes.
We always show both ways to construct the objects.
library(mlr3oml) library(mlr3) oml_data = OMLData$new(id = 31) # is the same as oml_data = odt(id = 31) oml_data
The full meta data can be accessed using the $desc
field.
Some fields, such as the number of rows and columns can be accessed directly.
# the usage licence oml_data$desc$licence # the data dimension c(n_rows = oml_data$nrow, n_cols = oml_data$ncol)
Information about the features can be accessed through the $features
field.
This includes information regarding the data types, missing values, whether they should be ignored for learning or whether they are the row identifier.
head(oml_data$features)
The data itself can be accessed using the $data
field.
We only show a subset of the data here for readability.
oml_data$data[1:5, 1:3]
We can convert this object to an mlr3::DataBackend
using the as_data_backend()
function.
backend = as_data_backend(oml_data) backend
Because this specific dataset has a default target in its meta data, we can also directly convert it to an mlr3::Task
.
# the default target oml_data$target_names # convert the OpenML data to an mlr3 task task = as_task(oml_data)
With either the backend
or the task
, we are now in mlr3
land again, and can work with the objects as usual:
rr = resample(task, lrn("classif.rpart"), rsmp("holdout"))
Below, we access the OpenML task with ID 261, which is a classification task built on top of the credit-g data used above. Its associated resampling is a 2/3 holdout split.
oml_task = OMLTask$new(id = 261) # is the same as oml_task = otsk(id = 261) oml_task
The OpenML data that the task is built on top of can be accessed through $data
.
oml_task$data
We can also access the target columns and the features. Note that this target can differ from the default target shown in the previous section.
oml_task$target_names oml_task$feature_names
The associated resampling splits can be accessed using $task_splits
.
oml_task$task_splits
The conversion to an mlr3::Task
is possible using the as_task()
converter.
# Convert OpenML task to mlr3 task task = as_task(oml_task) task
The associated resampling can be obtained by calling as_resampling()
.
# Convert OpenML task to mlr3 resampling resampling = as_resampling(oml_task) resampling
To simplify this, there exist "oml"
tasks and resamplings:
tsk("oml", task_id = 261) rsmp("oml", task_id = 261)
We can access the flow with ID 1068 as shown below:
flow = OMLFlow$new(id = 1068) # is the same as flow = oflw(id = 1068) flow
Flows themself only become interesting once they are applied to a task, the result of which is an OpenML run.
For example, the run with ID 169061 contains the result of applying the above flow to task 261 from above:
run = OMLRun$new(id = 169061) # is the same as run = orn(id = 169061) run # the corresponding flow and and task can be accessed directly run$flow run$task
The result of this experiment are the predictions, as well as the evaluation of these predictions.
head(run$prediction) head(run$evaluation)
OpenML runs can be converted to mlr3::ResampleResult
s using the as_resample_result()
function.
rr = as_resample_result(run) rr
Below, we access the OpenML-CC18, which is a curated collection of 72 OpenML classification tasks, i.e. a task collection.
cc18 = OMLCollection$new(id = 99) # is the same as cc18 = ocl(id = 99)
The ids of the tasks and datasets contained in this benchmarking suite can be accessed through the fields $task_ids
and $data_ids
respectively.
# the first 10 task ids cc18$task_ids[1:10] # the first 10 data ids cc18$data_ids[1:10]
We can, e.g., create an mlr3::Task
from the first of these tasks as follows:
task1 = tsk("oml", task_id = cc18$task_ids[1]) task1
While we showed how to work with objects with known IDs, another important question is how to find the relevant IDs.
This can either be achieved through the OpenML website or through the REST API.
To access the latter, mlr3oml
provides the following listing functions:
list_oml_data()
- Find datasetslist_oml_tasks()
- Find taskslist_oml_flows()
- Find flowslist_oml_runs()
- Find runsAs an example, we will only show the usage of the first function, but all others work analogously.
We can, for example, subset the datasets contained in the CC-18 even further. Below, we only select datasets that have between 0 and 10 features.
cc18_filtered = list_oml_tasks( data_id = cc18$data_ids, number_features = c(0, 10) ) cc18_filtered[1:5, c("task_id", "name")]
Note that not all possible property specifications can be directly queried on OpenML.
As the resulting tables are data.table
s containing information about the datasets, they can be further filtered using the usual data.table
syntax.
You can currently upload datasets to OpenML or create tasks and collections using the functions:
publish_data()
to upload a dataset,publish_task()
to create a task, andpublish_collection()
to create a collection.For this, you need an API key.
All download operations supported by this package work without an API key, but you might get rate limited without an API key.
For uploading to OpenML, you need an API key.
The API key can be specified via the option mlr3oml.api_key
or the environment variable OPENMLAPIKEY
(where the former has precedence over the latter).
To obtain an API key, you must create an account on OpenML.
mlr3oml
has its own logger, which can be accessed using lgr::get_logger("mlr3oml")
.
For more information about logging in general (such as chaning the logging threshold), we refer to the corresponding section in the mlr3book
.
All objects accessed through mlr3oml
must be downloaded from the OpenML server.
This is done lazily, which means that data is only downloaded when it is actually accessed.
To show this, we change the logging level, which was previously set to "warn"
(to keep the output clean), to "info"
.
logger = lgr::get_logger("mlr3oml") logger$set_threshold("info") oml_data = odt(31) # to print the object, some meta data must be downloaded oml_data
To download all information associated with an object, the $download()
method can be called.
This can be useful to ensure that all information is available offline.
In this case, only the actual underlying data is downloaded, as everything else was already implicityly accessed above.
oml_data$download()
Caching of OpenML objects can be enabled by setting the mlr3oml.cache
option to either TRUE
or FALSE
(default), or to a specific folder to be used as the cache directory.
When this is enabled, many OpenML objects are also available offline.
Note that OpenML collections are not cached, as IDs can be added or removed.
# Set a temporary directy as the cache folder cache_dir = tempfile() options(mlr3oml.cache = cache_dir) odata = odt(31) odata # When accessing the data again, nothing has to be downloaded # because the information is loaded from the cache odata_again = odt(31) odata_again # set back the logger logger$set_threshold("warn")
The datasets on OpenML are available in two different formats, namely arff and parquet.
The former is used by default, but this default can be changed by setting the mlr3oml.parquet
option to TRUE
.
It is also possible to specify this during construction of a specific OpenML object.
While the parquet format is more efficient, arff was the original format and might therefore considered to be more stable. Moreover, minor differences for the two different formats for a given data ID can occur, e.g. regarding the data type.
When converting an OMLData
object to an mlr3::DataBackend
using the parquet file type, the resulting backend is an mlr3db::DataBackendDuckDB
object.
For the arff file format, the resulting backend is a mlr3::DataBackendDataTable
.
library(mlr3db) odata_pq = odt(id = 31, parquet = TRUE) backend_pq = as_data_backend(odata_pq) class(backend_pq) # compare with arff odata_arff = odt(id = 31, parquet = FALSE) backend_arff = as_data_backend(odata_arff) class(backend_arff)
For more information on data backends, see the corresponding section in the mlr3book
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.