extension: Extending the emil framework with user-defined methods
In Molmed/emil: Evaluation of Modeling without Information Leakage

Description Fitting models Making predictions Calculating feature importance scores Calculating performance Resampling schemes Pre-processing functions Code style guidelines Author(s) See Also

This page describes how to implement custom methods compatible with the functions of the emil framework, most notably fit, tune, and evaluate. Pre-processing and resampling is not covered here, but in the entries pre_process and resample.

To write and use custom model fitting functions with the emil framework, it must take the the following inputs. Optional

function(x, y, p1, p2, p3, ..., .verbose)

x

The features (or variables) of the observations you want to train the model on. This is typically a matrix or data frame where each row corresponds to an observation. In case it is more natural to characterize your observations some other way, maybe as character vectors of varying length for some document classification method, x can be of any form you like as long as the fitting function knows how to handle it. In that case you will also need supply you own pre-processing function (see pre_process that can extract training and test sets from the entire data set.

See the functions pre_pamr and fit_pamr for an example of a function that does not take its data in the default way.

y

A response vector. This is the outcome you want to model, e.g. the feature of interest in a regression, class label in a classification problem, or anything else that a fitted model will produce when given data to make predictions from.

p1, p2, p3, ...

(Optional) Method-specific model parameters. These will all be tunable with the tune and evaluate functions. Note that you can give them any name you want, the names used here are just an example.

.verbose

(Optional) Indentation level of log messages. Feed this to log_message.

The function must return everything necessary to make future predictions, but it can take any form you like. In the simplest case it is just a number of fitted parameter values, like in a least squares regression, but it could also be some big and complex structure holding an ensemble of multiple sub-models.

Once a model is fitted it can be used to make predictions with a prediction function, defined as such

function(object, x, ...)

object: A fitted model produced by the model fitting function described above.
x: Observations to make predictions on (describing features only).
...: Parameters to the prediction functions. These are ignored by tune and evaluate, but could be convenient if the user wants to work with it manually.

The output of the prediction function must be an object that can be compared to the true response, by an error function (see below). It is typically a list with elements named "pred" for "predictions" or "risk" for estimated risks. It can also be on an arbitrary form as long as a compatible error function is used.

Estimating the importance of each feature (or variable) can often be as important as making predictions. Functions for calculating or extracting feature importance scores from fitted models should be defined as follows:

function(object, ...)

object: A fitted model produced by the model fitting function described above.
...: Parameters to the prediction functions. These are ignored by tune and evaluate, but could be convenient if the user wants to work with it manually.

The function should return a vector of length p or a p-by-c data frame where p is the number of features in the data set and c is the number of classes.

See error_fun.

See resample.

See pre_process.

Names of functions, arguments and variables should be written in underscore separated lower case, singular form, unabbreviated, and American English. Users are encouraged to also follow use this style when wrining extensions. However, the guidelines may be violated in cases where they break the consistency with an incorporated well established package, see for example fit_randomForest which according to the guidelines should be fit_randomforest or fit_random_forest.

A few exceptions to the rule against abbreviations exists, namely ‘fun’ and ‘function’ and ‘dir’ for ‘directory’. These are only used for arguments to indicate the type of value that is accepted.

Christofer Bäcklin

emil, error_fun, pre_process, resample

Molmed/emil documentation built on May 7, 2019, 4:58 p.m.