boostingDEA"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(boostingDEA)
set.seed(1234)

The production theory background

This vignette intends to explain the main functions of the boostingDEA package. In it, techniques from the field of machine learning are incorporated into solving problems in the production theory context. Specifically, two adaptations of the Gradient Boosting technique are introduced: EATBoost and MARSBoost. Gradient boosting is one of the variants of ensemble methods where multiple weak models are created and combined to get better performance as a whole. As a consequence, Gradient Boosting gives a prediction model in the form of an ensemble of weak prediction models. Specifically, at each step, a new weak model is trained to predict the "error" of the current strong model. In this package, whilst EATBoost uses an adaptation of regression trees known as Efficiency Analysis Trees as weak model, MARSBoost uses an adaptation of Multivariate Adaptive Regression Spline.

As previously said, we are dealing with a production theory context. Let us consider $n$ Decision Making Units (DMUs) to be evaluated. $DMU_i$ consumes $\textbf{x}i = (x{1i}, ...,x_{mi}) \in R^{m}{+}$ amount of inputs for the production of $\textbf{y}_i = (y{1i}, ...,y_{si}) \in R^{s}_{+}$ amount of outputs. The relative efficiency of each DMU in the sample is assessed regarding the so-called production possibility set or technology, which is the set of technically feasible combinations of $(\textbf{x, y})$. It is defined in general terms as:

\begin{equation} \Psi = {(\textbf{x, y}) \in R^{m+s}_{+}: \textbf{x} \text{ can produce } \textbf{y}} \end{equation}

Monotonicity (free disposability) of inputs and outputs is assumed, meaning that if $(\textbf{x, y}) \in \Psi$, then $(\textbf{x', y'}) \in \Psi$, as soon as $\textbf{x'} \geq \textbf{x}$ and $\textbf{y'} \leq \textbf{y}$. Often convexity of $\Psi$ is also assumed. The efficient frontier of $\Psi$ may be defined as $\partial(\boldsymbol{\Psi}) := {(\boldsymbol{x,y}) \in \boldsymbol{\Psi}: \boldsymbol{\hat{x}} < \boldsymbol{x}, \boldsymbol{\hat{y}} > \boldsymbol{y} \Rightarrow (\boldsymbol{\hat{x},\hat{y}}) \notin \boldsymbol{\Psi} }$.

The banks' database

The banks database is included as a data object in the boostingDEA library and is employed to exemplify the package functions. The data corresponds to 31 Taiwanese banks for the year 2010. The dataset was first obtained by Juo et. al, 2015 from the “Condition and Performance of Domestic Banks” published by the Central Bank of China (Taiwan) and the Taiwan Economic Journal for the year 2010.

The following variables are collected for all banks:

Revenue can be interpreted as a combination of the Financial.investments and Loans variables, and can be used as the target variable for a mono-output scenario, while Financial.investments and Loans for a multi-output scenario.

data(banks)
banks

Efficiency models

Standard techniques: DEA and FDH

Data envelopment analysis (DEA) is the standard nonparametric method for the estimation of production frontiers. In this context, the technology is calculated under assumptions of free disposability, convexity, deterministicness and minimal extrapolation.

The radial input DEA model can be computed using the DEA(data,x,y) function. Furthermore, in the case of the mono-output scenario, the ideal output value for the DMU in order to be efficient is also calculated.

x <- 1:3
y <- 6
DEA_model <- DEA(banks,x,y)
pred_DEA <- predict(DEA_model, banks, x, y)
pred_DEA

Similarly, FDH introduced also estimates production frontiers, but it is based upon only two axioms: free disposability and deterministicness. Therefore, it can be considered as the skeleton of DEA, since the convex hull of the DEA coincides with the DEA's frontier.

In the same fashion, the radial input FDH model can be computed in R using the FDH(data,x,y) function, where the ideal output in the case of the mono-output case is calculated as well.

x <- 1:3
y <- 6
FDH_model <- FDH(banks,x,y)
pred_FDH <- predict(FDH_model, banks, x, y)
pred_FDH

EATBoost

The EATBoost algorithm is an adaptation of the Gradient Tree Boosting algorithm to estimate production technologies. However, unlike the standard Gradient Tree Boosting algorithm which uses regression trees as base learners, the EATBoost technique uses EAT trees. Further modifications are also made to satisfy the required theoretical conditions. In particular, the algorithm was modified to deal with the axiom of free disposability in inputs and outputs and to provide estimates that envelop the data cloud from above. These two same postulates are also key in the definition of the standard FDH estimator of a technology. Therefore, this new approach shares similarities with the FDH methodology, but with the advantage that it avoids the typical problem of overfitting.

The EATBoost function receives as arguments the data (data) containing the study variables, the indexes of the predictor variables or inputs (x) and the indexes of the predicted variables or outputs (y). Moreover, the num.iterations, the learning.rate and num.leaves are hyperparameters for the model and are compulsory.

The function returns an EATBoost object.

x <- 1:3
y <- 4:5
EATBoost_model <- EATBoost(banks, x, y, 
                           num.iterations = 4, 
                           num.leaves = 4,
                           learning.rate = 0.6)

To try to find the best hyperparameters, we can resort to a grid of parameters values tested through training and test samples in a user specified proportion. In the package, this can be done through the function bestEATBoost. This function instead of receiving as arguments a single value for each hyperparameter, receives a vector, and evaluates each possible combination in the grid through Mean Square Error (MSE) and Root Mean Square Error (RMSE). Finally, it returns a data.frame with each possible combination ordered by RMSE.

N <- nrow(banks)
x <- 1:3
y <- 4:5
selected <- sample(1:N, N * 0.8) # Training indexes
training <- banks[selected, ] # Training set
test <- banks[- selected, ] # Test set
grid_EATBoost <- bestEATBoost(training, test, x, y,
                             num.iterations = c(5,6,7),
                             learning.rate = c(0.4, 0.5, 0.6),
                             num.leaves = c(6,7,8),
                             verbose = FALSE)
head(grid_EATBoost)
EATboost_model_tuned <- EATBoost(banks, x, y,
                                 num.iterations = grid_EATBoost[1, "num.iterations"],
                                 learning.rate = grid_EATBoost[1, "learning.rate"],
                                 num.leaves = grid_EATBoost[1, "num.leaves"])
pred_EATBoost <- predict(EATboost_model_tuned, banks, x)
pred_EATBoost

MARSBoost

The MARSBoost algorithm is an adaptation of the LS-Boosting algorithm to estimate production technologies. In this case, the base learner used in the algorithm is an adaptation of the MARS model. MARS essentially builds flexible models by fitting piecewise linear regressions; that is, the non-linearity of a model is approximated through the use of separate regression slopes in distinct intervals of the predictor variable space. The combinations of these models, which do not have a continuous first derivative, led to sharp trends. For this reason, a smoothing procedure can be applied. Thus, the estimator obtained without the smoothing procedure presents similarities with the one obtained by DEA, while the estimate in the second stage resembles more well-known (smoothed) functional forms typical of production theory; like Cobb-Douglas, CES or Translog.

The MARSBoost function works similarly to the EATBoost one. It receives as arguments the data (data) containing the study variables, the indexes of the predictor variables or inputs (x), the indexes of the predicted variables or outputs (y) and a set of hyperparameters:

The function returns an MARSBoost object and can be only used in mono-ouput scenarios.

x <- 1:3
y <- 6
MARSBoost_model <- MARSBoost(banks, x, y, 
                             num.iterations = 4, 
                             learning.rate = 0.6,
                             num.terms = 4)

In this case, to find the best hyperparameters, we can resort to the bestMARSBoost function. Here, we can create a grid of hyperparameters to find the optimal value for num.iterations, learning.rate and num.terms.

N <- nrow(banks)
x <- 1:3
y <- 6
selected <- sample(1:N, N * 0.8) # Training indexes
training <- banks[selected, ] # Training set
test <- banks[- selected, ] # Test set
grid_MARSBoost <- bestMARSBoost(training, test, x, y,
                              num.iterations = c(5,6,7),
                              learning.rate = c(0.4, 0.5, 0.6),
                              num.terms = c(6,7,8),
                              verbose = FALSE)
head(grid_MARSBoost)
MARSBoost_model_tuned <- MARSBoost(banks, x, y,
                                   num.iterations = grid_MARSBoost[1, "num.iterations"],
                                   learning.rate = grid_MARSBoost[1, "learning.rate"],
                                   num.terms = grid_MARSBoost[1, "num.terms"])
pred_MARSBoost <- predict(MARSBoost_model_tuned, banks, x)
pred_MARSBoost

Measuring technical efficiency

Technical inefficiency is defined as the distance from a point that belongs to $\Psi$ to the production frontier $\partial(\Psi)$. For a point located inside $\Psi$, it is evident that there are many possible paths to the frontier, each associated with a different technical inefficiency measure.

The function efficiency calculates the efficiency score corresponding to the given model using the given measure. A dataset (data) and the corresponding indexes of input(s) (x) and output(s) (y) must be entered. It is recommended that the dataset with the DMUs whose efficiency is to be calculated coincide with those used to estimate the frontier. The possible argument of this function are:

For this section, the previously created models are used.

Let's first see an example using the standard DEA and FDH techniques. For both techniques, all measures can be calculated.

x <- 1:3
y <- 6
efficiency(DEA_model, 
           measure = "rad.in",
           banks, x, y)
efficiency(FDH_model, 
           measure = "WAM",
           weights = "RAM",
           banks, x, y)

In the case of the EATBoost algorithm, all measures can be calculated as well.

x <- 1:3
y <- 4:5
efficiency(EATboost_model_tuned, 
           measure = "Russell.out",
           heuristic = FALSE,
           banks, x, y)

However, due to the extreme complexity at a computational level of the exact efficiency approach, the heuristic hyperparameter can be specified to resort to the simpler less time-consuming heuristic approach. In fact, heuristic is the default mode for EATBoost.

efficiency(EATboost_model_tuned, 
           measure = "Russell.out",
           banks, x, y,
           heuristic = TRUE)

Finally, for the MARSBoost algorithm, only the radial output measure can be calculated.

efficiency(MARSBoost_model_tuned, "rad.out", banks, x, 6)


Try the boostingDEA package in your browser

Any scripts or data that you put into this service are public.

boostingDEA documentation built on May 31, 2023, 6:33 p.m.