Special Tasks {#special-tasks}

This chapter explores the different functions of mlr3 when dealing with specific data sets that require further statistical modification to undertake sensible analysis. Following topics are discussed:

Survival Analysis

This sub-chapter explains how to conduct sound survival analysis in mlr3. Survival analysis is used to monitor the period of time until a specific event takes places. This specific event could be e.g. death, transmission of a disease, marriage or divorce. Two considerations are important when conducting survival analysis:

In summary, this sub-chapter explains how to account for these considerations and conduct survival analysis using the mlr3proba extension package.

Spatial Analysis

Spatial analysis data observations entail reference information about spatial characteristics. One of the largest shortcomings of spatial data analysis is the inevitable auto-correlation in spatial data. Auto-correlation is especially severe in data with marginal spatial variation. The sub-chapter on spatial analysis provides instructions on how to handle the problems associated with spatial data accordingly.

Ordinal Analysis

This is work in progress. See r gh_pkg("mlr-org/mlr3ordinal") for the current state.

Functional Analysis

Functional analysis contains data that consists of curves varying over a continuum e.g. time, frequency or wavelength. This type of analysis is frequently used when examining measurements over various time points. Steps on how to accommodate functional data structures in mlr3 are explained in the functional analysis sub-chapter.

Multilabel Classification

Multilabel classification deals with objects that can belong to more than one category at the same time. Numerous target labels are attributed to a single observation. Working with multilabel data requires one to use modified algorithms, to accommodate data specific characteristics. Two approaches to multilabel classification are prominently used:

Instructions on how to deal with multilabel classification in mlr3 can be found in this sub-chapter.

Cost Sensitive Classification

This sub-chapter deals with the implementation of cost-sensitive classification. Regular classification aims to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. Cost-sensitive classification is a setting where the costs caused by different kinds of errors are not assumed to be equal. The objective is to minimize the expected costs.

Analytical data for a big credit institution is used as a use case to illustrate the different features. Firstly, the sub-chapter provides guidance on how to implement a first model. Subsequently, the sub-chapter contains instructions on how to modify cost sensitivity measures, thresholding and threshold tuning.

Survival Analysis {#survival}

Survival analysis examines data on whether a specific event of interest takes place and how long it takes till this event occurs. One cannot use ordinary regression analysis when dealing with survival analysis data sets. Firstly, survival data contains solely positive values and therefore needs to be transformed to avoid biases. Secondly, ordinary regression analysis cannot deal with censored observations accordingly. Censored observations are observations in which the event of interest has not occurred, yet. Survival analysis allows the user to handle censored data with limited time frames that sometimes do not entail the event of interest. Note that survival analysis accounts for both censored and uncensored observations while adjusting respective model parameters.

The package r mlr_pkg("mlr3proba") extends r mlr_pkg("mlr3") with the following objects for survival analysis:

In this example we demonstrate the basic functionality of the package on the r ref("survival::rats", text = "rats") data from the r cran_pkg("survival") package. This task ships as pre-defined r ref("TaskSurv") with r mlr_pkg("mlr3proba").

library(mlr3proba)
task = tsk("rats")
print(task)

# the target column is a survival object:
head(task$truth())

# kaplan-meier plot
library(mlr3viz)
autoplot(task)

Now, we conduct a small benchmark study on the r ref("mlr_tasks_rats", text = "rats") task using some of the integrated survival learners:

# some integrated learners
learners = lapply(c("surv.coxph", "surv.kaplan", "surv.ranger"), lrn)
print(learners)

# Uno's C-Index for survival
measure = msr("surv.unoC")
print(measure)

set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
autoplot(bmr, measure = measure)

Spatial Analysis {#spatial}

Spatial data observations entail reference information about spatial characteristics. This information is frequently stored as coordinates named 'x' and 'y'. Treating spatial data using non-spatial data methods could lead to over-optimistic treatment. This is due to the underlying auto-correlation in spatial data.

See r gh_pkg("mlr-org/mlr3spatiotemporal") for the current state of the implementation.

Ordinal Analysis {#ordinal}

This is work in progress. See r gh_pkg("mlr-org/mlr3ordinal") for the current state of the implementation.

Functional Analysis {#functional}

Functional data is data containing an ordering on the dimensions. This implies that functional data consists of curves varying over a continuum, such as time, frequency, or wavelength.

How to model functional data?

There are two ways to model functional data:

More following soon!

Multilabel Classification {#multilabel}

Multilabel deals with objects that can belong to more than one category at the same time.

More following soon!

Cost-Sensitive Classification {#cost-sens}

In regular classification the aim is to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. A more general setting is cost-sensitive classification. Cost sensitive classification does not assume that the costs caused by different kinds of errors are equal. The objective of cost sensitive classification is to minimize the expected costs.

Imagine you are an analyst for a big credit institution. Let's also assume that a correct decision of the bank would result in 35% of the profit at the end of a specific period. A correct decision means that the bank predicts that a customer will pay their bills (hence would obtain a loan), and the customer indeed has good credit. On the other hand, a wrong decision means that the bank predicts that the customer's credit is in good standing, but the opposite is true. This would result in a loss of 100% of the given loan.

| | Good Customer (truth) | Bad Customer (truth) | | :-----------------------: | :-------------------------: | :------------------------: | | Good Customer (predicted) | + 0.35 | - 1.0 | | Bad Customer (predicted) | 0 | 0 |

Expressed as costs (instead of profit), we can write down the cost-matrix as follows:

costs = matrix(c(-0.35, 0, 1, 0), nrow = 2)
dimnames(costs) = list(response = c("good", "bad"), truth = c("good", "bad"))
print(costs)

An exemplary data set for such a problem is the r ref("mlr_tasks_german_credit", text = "German Credit") task:

library(mlr3)
task = tsk("german_credit")
table(task$truth())

The data has 70% customers who are able to pay back their credit, and 30% bad customers who default on the debt. A manager, who doesn't have any model, could decide to give either everybody a credit or to give nobody a credit. The resulting costs for the German credit data are:

# nobody:
(700 * costs[2, 1] + 300 * costs[2, 2]) / 1000

# everybody
(700 * costs[1, 1] + 300 * costs[1, 2]) / 1000

If the average loan is $20,000, the credit institute would lose more than one million dollar if it would grant everybody a credit:

# average profit * average loan * number of customers
0.055 * 20000 * 1000

Our goal is to find a model which minimizes the costs (and thereby maximizes the expected profit).

A First Model

For our first model, we choose an ordinary logistic regression (implemented in the add-on package r mlr_pkg("mlr3learners")). We first create a classification task, then resample the model using a 10-fold cross validation and extract the resulting confusion matrix:

library(mlr3learners)
learner = lrn("classif.log_reg")
rr = resample(task, learner, rsmp("cv"))

confusion = rr$prediction()$confusion
print(confusion)

To calculate the average costs like above, we can simply multiply the elements of the confusion matrix with the elements of the previously introduced cost matrix, and sum the values of the resulting matrix:

avg_costs = sum(confusion * costs) / 1000
print(avg_costs)

With an average loan of \$20,000, the logistic regression yields the following costs:

avg_costs * 20000 * 1000

Instead of losing over \$1,000,000, the credit institute now can expect a profit of more than \$1,000,000.

Cost-sensitive Measure

Our natural next step would be to further improve the modeling step in order to maximize the profit. For this purpose we first create a cost-sensitive classification measure which calculates the costs based on our cost matrix. This allows us to conveniently quantify and compare modeling decisions. Fortunately, there already is a predefined measure r ref("Measure") for this purpose: r ref("MeasureClassifCosts"):

cost_measure = msr("classif.costs", costs = costs)
print(cost_measure)

If we now call r ref("resample()") or r ref("benchmark()"), the cost-sensitive measures will be evaluated. We compare the logistic regression to a simple featureless learner and to a random forest from package r cran_pkg("ranger") :

learners = list(
  lrn("classif.log_reg"),
  lrn("classif.featureless"),
  lrn("classif.ranger")
)
cv3 = rsmp("cv", folds = 3)
bmr = benchmark(benchmark_grid(task, learners, cv3))
bmr$aggregate(cost_measure)

As expected, the featureless learner is performing comparably bad. The logistic regression and the random forest work equally well.

Thresholding

Although we now correctly evaluate the models in a cost-sensitive fashion, the models themselves are unaware of the classification costs. They assume the same costs for both wrong classification decisions (false positives and false negatives). Some learners natively support cost-sensitive classification (e.g., XXX). However, we will concentrate on a more generic approach which works for all models which can predict probabilities for class labels: thresholding.

Most learners can calculate the probability $p$ for the positive class. If $p$ exceeds the threshold $0.5$, they predict the positive class, and the negative class otherwise.

For our binary classification case of the credit data, the we primarily want to minimize the errors where the model predicts "good", but truth is "bad" (i.e., the number of false positives) as this is the more expensive error. If we now increase the threshold to values $> 0.5$, we reduce the number of false negatives. Note that we increase the number of false positives simultaneously, or, in other words, we are trading false positives for false negatives.

# fit models with probability prediction
learner = lrn("classif.log_reg", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"))
p = rr$prediction()
print(p)

# helper function to try different threshold values interactively
with_threshold = function(p, th) {
  p$set_threshold(th)
  list(confusion = p$confusion, costs = p$score(measures = cost_measure, task = task))
}

with_threshold(p, 0.5)
with_threshold(p, 0.75)
with_threshold(p, 1.0)

# TODO: include plot of threshold vs performance

Instead of manually trying different threshold values, one uses use r ref("optimize()") to find a good threshold value w.r.t. our performance measure:

# simple wrapper function which takes a threshold and returns the resulting model performance
# this wrapper is passed to optimize() to find its minimum for thresholds in [0.5, 1]
f = function(th) {
  with_threshold(p, th)$costs
}
best = optimize(f, c(0.5, 1))
print(best)

# optimized confusion matrix:
with_threshold(p, best$minimum)$confusion

{block, type = "warning"} The function `optimize()` is intended for unimodal functions and therefore may converge to a local optimum here. See below for better alternatives to find good threshold values.

Threshold Tuning

More following soon!

Upcoming sections will entail:

More following soon!



nguyenngocbinh/mlr3_book_vi documentation built on Jan. 23, 2020, 12:28 p.m.