library(mlr3oml)
logger = lgr::get_logger("mlr3oml")
logger$set_threshold("warn")
options(mlr3oml.cache = FALSE)

mlr3oml

This tutorial will give you a quick overview of the main features of mlr3oml. If you are not familiar with OpenML, we recommend to read its documentation first, as we will not explain the OpenML concepts in detail here. Further coverage of some selected mlr3oml features can be found in the Large-Scale Benchmarking chapter of the mlr3book. Note that mlr3oml currently only supports downloading objects from OpenML. Uploading can for example be achieved through the website.

First, we will briefly cover the different OpenML objects that can be downloaded using mlr3oml. Then we will show how to find objects with certain properties on OpenML. Finally, we will quickly discuss some further aspects of mlr3oml, which includes caching, file formats, laziness, the logger, and the API key.

OpenML Objects {#sec-objects}

mlr3oml supports five different types of OpenML objects that are listed below. All objects can be converted to their corresponding mlr3 pandeaunt.

Each object on OpenML has a unique identifier, by which it can be retrieved. We will now briefly show how to access and work with these objects.

Data

Below, we retrieve the dataset with ID 31, which is the credit-g data and can be viewed online here. Like in other mlr3 packages, sugar functions exist for the construction of R6 classes. We always show both ways to construct the objects.

library(mlr3oml)
library(mlr3)

oml_data = OMLData$new(id = 31)
# is the same as
oml_data = odt(id = 31)
oml_data

The full meta data can be accessed using the $desc field. Some fields, such as the number of rows and columns can be accessed directly.

# the usage licence
oml_data$desc$licence

# the data dimension
c(n_rows = oml_data$nrow, n_cols = oml_data$ncol)

Information about the features can be accessed through the $features field. This includes information regarding the data types, missing values, whether they should be ignored for learning or whether they are the row identifier.

head(oml_data$features)

The data itself can be accessed using the $data field. We only show a subset of the data here for readability.

oml_data$data[1:5, 1:3]

We can convert this object to an mlr3::DataBackend using the as_data_backend() function.

backend = as_data_backend(oml_data)
backend

Because this specific dataset has a default target in its meta data, we can also directly convert it to an mlr3::Task.

# the default target
oml_data$target_names

# convert the OpenML data to an mlr3 task
task = as_task(oml_data)

With either the backend or the task, we are now in mlr3 land again, and can work with the objects as usual:

rr = resample(task, lrn("classif.rpart"), rsmp("holdout"))

Task

Below, we access the OpenML task with ID 261, which is a classification task built on top of the credit-g data used above. Its associated resampling is a 2/3 holdout split.

oml_task = OMLTask$new(id = 261)
# is the same as
oml_task = otsk(id = 261)
oml_task

The OpenML data that the task is built on top of can be accessed through $data.

oml_task$data

We can also access the target columns and the features. Note that this target can differ from the default target shown in the previous section.

oml_task$target_names
oml_task$feature_names

The associated resampling splits can be accessed using $task_splits.

oml_task$task_splits

The conversion to an mlr3::Task is possible using the as_task() converter.

# Convert OpenML task to mlr3 task
task = as_task(oml_task)
task

The associated resampling can be obtained by calling as_resampling().

# Convert OpenML task to mlr3 resampling
resampling = as_resampling(oml_task)
resampling

To simplify this, there exist "oml" tasks and resamplings:

tsk("oml", task_id = 261)
rsmp("oml", task_id = 261)

Flows and Runs

We can access the flow with ID 1068 as shown below:

flow = OMLFlow$new(id = 1068)
# is the same as
flow = oflw(id = 1068)
flow

Flows themself only become interesting once they are applied to a task, the result of which is an OpenML run.

For example, the run with ID 169061 contains the result of applying the above flow to task 261 from above:

run = OMLRun$new(id = 169061)
# is the same as
run = orn(id = 169061)
run

# the corresponding flow and and task can be accessed directly
run$flow
run$task

The result of this experiment are the predictions, as well as the evaluation of these predictions.

head(run$prediction)
head(run$evaluation)

OpenML runs can be converted to mlr3::ResampleResults using the as_resample_result() function.

rr = as_resample_result(run)
rr

Collection

Below, we access the OpenML-CC18, which is a curated collection of 72 OpenML classification tasks, i.e. a task collection.

cc18 = OMLCollection$new(id = 99)
# is the same as
cc18 = ocl(id = 99)

The ids of the tasks and datasets contained in this benchmarking suite can be accessed through the fields $task_ids and $data_ids respectively.

# the first 10 task ids
cc18$task_ids[1:10]
# the first 10 data ids
cc18$data_ids[1:10]

We can, e.g., create an mlr3::Task from the first of these tasks as follows:

task1 = tsk("oml", task_id = cc18$task_ids[1])
task1

Listing {#sec-listing}

While we showed how to work with objects with known IDs, another important question is how to find the relevant IDs. This can either be achieved through the OpenML website or through the REST API. To access the latter, mlr3oml provides the following listing functions:

As an example, we will only show the usage of the first function, but all others work analogously.

We can, for example, subset the datasets contained in the CC-18 even further. Below, we only select datasets that have between 0 and 10 features.

cc18_filtered = list_oml_tasks(
  data_id = cc18$data_ids,
  number_features = c(0, 10)
)
cc18_filtered[1:5, c("task_id", "name")]

Note that not all possible property specifications can be directly queried on OpenML. As the resulting tables are data.tables containing information about the datasets, they can be further filtered using the usual data.table syntax.

Uploading

You can currently upload datasets to OpenML or create tasks and collections using the functions:

For this, you need an API key.

API Key

All download operations supported by this package work without an API key, but you might get rate limited without an API key. For uploading to OpenML, you need an API key. The API key can be specified via the option mlr3oml.api_key or the environment variable OPENMLAPIKEY (where the former has precedence over the latter). To obtain an API key, you must create an account on OpenML.

Further Aspects

Logging

mlr3oml has its own logger, which can be accessed using lgr::get_logger("mlr3oml"). For more information about logging in general (such as chaning the logging threshold), we refer to the corresponding section in the mlr3book.

Laziness

All objects accessed through mlr3oml must be downloaded from the OpenML server. This is done lazily, which means that data is only downloaded when it is actually accessed. To show this, we change the logging level, which was previously set to "warn" (to keep the output clean), to "info".

logger = lgr::get_logger("mlr3oml")
logger$set_threshold("info")

oml_data = odt(31)
# to print the object, some meta data must be downloaded
oml_data

To download all information associated with an object, the $download() method can be called. This can be useful to ensure that all information is available offline. In this case, only the actual underlying data is downloaded, as everything else was already implicityly accessed above.

oml_data$download()

Caching

Caching of OpenML objects can be enabled by setting the mlr3oml.cache option to either TRUE or FALSE (default), or to a specific folder to be used as the cache directory. When this is enabled, many OpenML objects are also available offline. Note that OpenML collections are not cached, as IDs can be added or removed.

# Set a temporary directy as the cache folder
cache_dir = tempfile()
options(mlr3oml.cache = cache_dir)

odata = odt(31)
odata

# When accessing the data again, nothing has to be downloaded
# because the information is loaded from the cache
odata_again = odt(31)
odata_again

# set back the logger
logger$set_threshold("warn")

Data Types

The datasets on OpenML are available in two different formats, namely arff and parquet. The former is used by default, but this default can be changed by setting the mlr3oml.parquet option to TRUE. It is also possible to specify this during construction of a specific OpenML object.

While the parquet format is more efficient, arff was the original format and might therefore considered to be more stable. Moreover, minor differences for the two different formats for a given data ID can occur, e.g. regarding the data type.

When converting an OMLData object to an mlr3::DataBackend using the parquet file type, the resulting backend is an mlr3db::DataBackendDuckDB object. For the arff file format, the resulting backend is a mlr3::DataBackendDataTable.

library(mlr3db)
odata_pq = odt(id = 31, parquet = TRUE)
backend_pq = as_data_backend(odata_pq)
class(backend_pq)

# compare with arff
odata_arff = odt(id = 31, parquet = FALSE)
backend_arff = as_data_backend(odata_arff)
class(backend_arff)

For more information on data backends, see the corresponding section in the mlr3book.



mlr-org/mlr3oml documentation built on Aug. 12, 2024, 7:32 p.m.