knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "",
  warning = FALSE,
  message = FALSE
)

A brief introduction to production theory field

This vignette is intended to know the main functions of the eat package. Efficiency Analysis Trees is an algorithm that estimates a production frontier in a data-driven environment by adapting regression trees. In this way, techniques from the field of machine learning are incorporated into solving problems in the field of production theory. From the latter, the following terminology is introduced.

Let us consider $n$ Decision Making Units (DMUs) to be evaluated. $DMU_i$ consumes $\textbf{x}i = (x{1i}, ...,x_{mi}) \in R^{m}{+}$ amount of inputs for the production of $\textbf{y}_i = (y{1i}, ...,y_{si}) \in R^{s}_{+}$ amount of outputs. The relative efficiency of each DMU in the sample is assessed with reference to the so-called production possibility set or technology, which is the set of technically feasible combinations of $(\textbf{x, y})$. It is defined in general terms as:

\begin{equation} \Psi = {(\textbf{x, y}) \in R^{m+s}_{+}: \textbf{x} \text{ can produce } \textbf{y}} \end{equation}

Monotonicity (free disposability) of inputs and outputs is assumed, meaning that if $(\textbf{x, y}) \in \Psi$, then $(\textbf{x', y'}) \in \Psi$, as soon as $\textbf{x'} \geq \textbf{x}$ and $\textbf{y'} \leq \textbf{y}$. Often convexity of $\Psi$ is also assumed. The efficient frontier of $\Psi$ may be defined as $\partial(\boldsymbol{\Psi}) := {(\boldsymbol{x,y}) \in \boldsymbol{\Psi}: \boldsymbol{\hat{x}} < \boldsymbol{x}, \boldsymbol{\hat{y}} > \boldsymbol{y} \Rightarrow (\boldsymbol{\hat{x},\hat{y}}) \notin \boldsymbol{\Psi} }$. Technical inefficiency is defined as the distance from a point that belongs to $\Psi$ to the production frontier $\partial(\Psi)$. For a point located inside $\Psi$, it is evident that there are many possible paths to the frontier, each associated with a different technical inefficiency measure.

Summary of eat functions

In this section, an EAT model, a RFEAT model, a FDH model and a DEA model refer to a modeling carried out using Efficiency Analysis Trees technique, Random Forest + Efficiency Analysis Trees technique, Free Disposal Hull method and Data Envelopment Analysis method, respectively. Additionally, a CEAT model refers to a convex EAT model. The functions developed in the eat library are always oriented to one of the four previous models (EAT, RFEAT, FDH or DEA) and can be divided into seven categories depending on their purpose:

library(dplyr)

functions <- data.frame("Purpose" = c(rep("Model", 2),
                                      rep("Summarize", 5),
                                      rep("Tune", 2), 
                                      rep("Graph", 3),
                                      rep("Calculate efficiency scores", 3), 
                                      rep("Graph efficiency scores", 2),
                                      rep("Predict", 2), 
                                      rep("Rank", 2)), 
                        "Function name" = c("EAT", "RFEAT",
                                            "print", "summary", "size", "frontier.levels", "descrEAT",
                                            "bestEAT", "bestRFEAT", 
                                            "frontier", "plotEAT", "plotRFEAT",
                                            "efficiencyEAT", "efficiencyCEAT", "efficiencyRFEAT",
                                            "efficiencyDensity", "efficiencyJitter",
                                            "predictEAT", "predictRFEAT",
                                            "rankingEAT", "rankingRFEAT"), 
                        "Usage" = c("Apply Efficiency Analysis Trees technique to a data set. Return an EAT object.",
                                    "Apply Random Forest + Efficiency Analysis Trees technique to a data set. Return a RFEAT object.",
                                    "For an EAT object: print the tree structure of an EAT model. 
                                    For a RFEAT object: print a brief summary of a RFEAT model.",
                                    "For an EAT object: return a summary for the leaf nodes, general information about the model and the error and
                                    threshold for each split and surrogate split.",
                                    "Return the number of leaf nodes for an EAT model.",
                                    "Return the frontier output levels at the leaf nodes for an EAT model.",
                                    "Return measures of centralization and dispersion with respect to the outputs for the nodes of an EAT model.",
                                    "Tune an EAT model.",
                                    "Tune a RFEAT model.",
                                    "Plot the estimated frontier through an EAT model in a low dimensional scenario
                                    (FDH estimated frontier is optional).",
                                    "Plot the tree structure of an EAT model.",
                                    "Shows a line graph with the OOB error on the y-axis, calculated from a forest made up of k trees (x-axis).",
                                    "Calculate DMU efficiency scores through an EAT (and optionally through a FDH) model.",
                                    "Calculate DMU efficiency scores through a convex EAT (and optionally through a DEA) model.",
                                    "Calculate DMU efficiency scores through a RFEAT (and optionally through a FDH) model.",
                                    "Graph a density plot for a data frame of efficiency scores (EAT, FDH, CEAT, DEA and RFEAT are available).",
                                    "Graph a jitter plot for a vector of efficiency scores calculated through an EAT model 
                                    (EAT or CEAT scores are accepted).",
                                    "Predict the output through an EAT model.",
                                    "Predict the output through a RFEAT model.",
                                    "Calculate variable importance scores through an EAT model.",
                                    "Calculate variable importance scores through a RFEAT model.")
)

kableExtra::kable(functions) %>%
  kableExtra::kable_styling("striped", full_width = F) %>%
  kableExtra::collapse_rows(columns = 1, valign = "middle") %>%
  kableExtra::row_spec(c(1:2, 8:9, 13:15, 18:19), background = "#DBFFD6") %>%
  kableExtra::row_spec(c(3:7, 10:12, 16:17, 20:21), background = "#FFFFD1") 

The PISAindex database

The PISAindex database is included as a data object in the eat library and is employed to exemplify the package functions. On the one hand, the inputs correspond to 13 variables that define the socioeconomic context of a country, by means of a score in the range [1-100], except for the Gross Domestic Product by Purchasing Power Parity, which is measured in thousands of dollars. All of them have been obtained from the Social Progress Index. On the other hand, the performance of each country in the PISA exams is measured by the average score of its schools in the disciplines of Science, Reading and Mathematics which have been collected from PISA 2018 Results.

The following variables are collected for 72 countries that take the PISA exam:

The eat package is applied with the following purposes: (1) to create homogeneous groups of countries in terms of their socioeconomic characteristics (Basic Human Needs, Foundations of Well-being, Opportunity and GDP PPP per capita) and subsequently to know what maximum PISA score is expected in one or more specific disciplines for each of these groups; (2) to know which countries exercise best practices and which of them do not obtain a performance according to their socioeconomic level; and (3) to know what variables are more relevant in obtaining efficient levels of output.

# We save the seed for reproducibility of the results
set.seed(120)
library(eat)
data("PISAindex")

Modeling a scenario with an input and an output. Plotting the frontier

EAT()

The EAT function is the centerpiece of the eat library. EAT performs a regression tree based on CART methodology under a new approach that guarantees obtaining a frontier as an estimator that fulfills the property of free disposability. This new technique has been baptized as Efficiency Analysis Trees. The development of the functions contained in the eat library has been designed so that even true R novices can use the library easily. The minimum arguments of the function are the data (data) containing the study variables, the indexes of the predictor variables or inputs (x) and the indexes of the predicted variables or outputs (y). Additionally, the numStop, fold, max.depth and max.leaves arguments are included for those more experienced users in the fields of machine learning and tree-based models. Modifying these four allows obtaining different frontiers and therefore selecting the one that best suits the needs of the analysis.

Note that including the arguments max.depth or max.leaves hyperparameters reduce the computation time by eliminating the pruning procedure. If both are included at the same time, a warning message is displayed and only max.depth is used.

The error of a given node $t$ is measured as the prediction error at the node $t$ over the total number of observations: \begin{equation} R(t) = \frac{n(t)}{N} \cdot MSE(t) = \frac{1}{N} \cdot \sum_{(x_i,y_i)\in t}(y_i - \hat{y}(t))^2 \end{equation}

The impurity of a tree $T$ is measured as the sum of the impurities for each leaf node \begin{equation} R(T) = \sum_{i = 1}^{\widetilde{T}}R(t_i), \end{equation}

where $\widetilde{T}$ is the set of leaf nodes for the tree $T$.

The function returns an EAT object.

EAT(data, x, y, 
    fold = 5,
    numStop = 5,
    max.depth = NULL,
    max.leaves = NULL,
    na.rm = TRUE)
single_model <- EAT(data = PISAindex, 
                    x = 15, # input 
                    y = 3) # output

print() returns the tree structure for an EAT object where:

print(single_model)

summary() returns the following information of an EAT object:

summary(single_model)

size() returns the number of leaf nodes of an EAT model:

size(single_model)

frontier.levels() returns the frontier levels of the outputs at the leaf nodes:

frontier.levels(single_model)

descrEAT() returns a list with measures of centralization and dispersion, as well as the root mean square error (RMSE) for each node. In multioutput scenarios, the measurements are shown for each output. In case of a single output, the result of the function is a data frame. The following information is obtained:

descriptiveEAT <- descrEAT(single_model)

descriptiveEAT

Additionally, EAT_object[["tree"]][[id_node]] or EAT_object$tree[[id_node]] returns a list that allows knowing the characteristics of a given node in greater detail. The elements that define a node are the following:

Note that:

single_model[["tree"]][[5]]

Categorical variables

The types of variables accepted by the EAT function are the following:

types <- data.frame("Variable" = c("Independent variables (inputs)", 
                                   "Dependent variables (outputs)"),
                    "Integer" = c("x", "x"),
                    "Numeric" = c("x", "x"),
                    "Factor" = c("", ""),
                    "Ordered factor" = c("x", ""))

kableExtra::kable(types, align = rep("c", 5)) %>%
  kableExtra::kable_styling("striped", full_width = F)

The Efficiency Analysis Trees methodology does not allow categorical variables. At this time, only ordinal factors can be entered. It is important to note that order = True must be included in the factor construction so as not to produce an error.

# Transform Continent to Factor
PISAindex_factor_Continent <- PISAindex
PISAindex_factor_Continent$Continent <- as.factor(PISAindex_factor_Continent$Continent)
error_model <- EAT(data = PISAindex_factor_Continent, 
                   x = c(2, 15), # input
                   y = 3) # output
# Cateogirze GDP_PPP into 4 groups: Low, Medium, High, Very High.  
PISAindex_GDP_PPP_cat <- PISAindex
PISAindex_GDP_PPP_cat$GDP_PPP_cat <- cut(PISAindex_GDP_PPP_cat$GDP_PPP,
                                         breaks = c(0, 16.686, 31.419, 47.745, Inf),
                                         include.lowest = T,
                                         labels = c("Low", "Medium", "High", "Very high"))

class(PISAindex_GDP_PPP_cat$GDP_PPP_cat) # "factor" --> error

# It is necessary to indicate order = TRUE, before applying the EAT function

PISAindex_GDP_PPP_cat$GDP_PPP_cat <- factor(PISAindex_GDP_PPP_cat$GDP_PPP_cat, 
                                            order = TRUE)

class(PISAindex_GDP_PPP_cat$GDP_PPP_cat) # "ordered" "factor" --> correct
categorized_model <- EAT(data = PISAindex_GDP_PPP_cat, 
                         x = c(15, 19), # input
                         y = 3) # output

frontier()

The frontier function displays the frontier estimated by the EAT function through a plot from ggplot2. The frontier estimated by FDH can also be plotted if FDH = TRUE. Observed DMUs can be showed by a scatterplot if observed.data = TRUE and its color, shape and size can be modified with observed.color, pch and size respectively. Finally, rownames can be included with rwn = TRUE.

frontier(object,
         FDH = FALSE,
         observed.data = FALSE,
         observed.color = "black",
         pch = 19,
         size = 1,
         rwn = FALSE,
         max.overlaps = 10)

To continue, the frontier of the previous model is displayed. It can be seen how the frontier obtained by the EAT function generalizes the results of the frontier obtained through FDH, thus avoiding overfitting. The boundary estimated through Efficiency Analysis Trees generates 3 steps corresponding to the 3 leaf nodes (nodes 3, 4 and 5) obtained with the EAT function. For each of these steps, a frontier level in terms of the output is given with respect to the amount of input used (in this case level of PFC). In addition, we can appreciate 6 DMUs on the frontier: ALB (Albania), MDA (Moldova), SRB (Serbia), RUS (Russia), HUN (Hungary) and SGP (Singapore). Note that the first vertical plane of the frontier does not appear, but if it did, ALB would be on it. These DMUs are efficient and the rest of the DMUs below their specific step should increase the amount of output obtained or reduce the amount of input utilized until reaching the boundary to be efficient.

frontier <- frontier(object = single_model,
                     FDH = TRUE, 
                     observed.data = TRUE,
                     rwn = TRUE)

plot(frontier)

The answer is no. Note that there may be situations where the estimation of two or more nodes is identical. This is necessary to ensure the estimation of an increasing monotonic frontier. In this case, the number of leaf nodes is 5, however the predictions for nodes 4 and 5 are the same and therefore the border only has 4 steps.

single_model_md <- EAT(data = PISAindex, 
                       x = 15, # input 
                       y = 3, # output
                       max.leaves = 5) 
size(single_model_md)
single_model_md[["model"]][["y"]]
frontier_md <- frontier(object = single_model_md,
                        observed.data = TRUE)

plot(frontier_md)

Modeling a multioutput scenario. Feature selection.

multioutput_model <- EAT(data = PISAindex, 
                         x = 6:18, # input 
                         y = 3:5 # output
                         ) 

rankingEAT()

The second example presents a multiple output scenario where 13 inputs are used to model the 3 available outputs. In these situations, a selection of the most contributing variables may be recommended in order to reduce overfitting, improve precision and reduce future training times. rankingEAT() allows a selection of variables by calculating a score of importance through the Efficiency Analysis Trees technique. The user can specify the number of decimal units (digits), include a barplot with the scores of importance (barplot) and display a horizontal line in the graph to facilitate the cut-off point between important and irrelevant variables (threshold).

rankingEAT(object,
           barplot = TRUE,
           threshold = 70,
           digits = 2)

The importance score represents how influential each variable is in the model. In this case, the cut-off point is set at 70 and therefore important variables are considered: AAE (Acess to Advance Education), WS (Water and Sanitation), NBMC (Nutrition and Basic Medical Care), HW (Health and Wellness) and S (Shelter).

rankingEAT(object = multioutput_model,
           barplot = TRUE,
           threshold = 70,
           digits = 2)

Graphical representation by a tree structure

plotEAT()

frontier() allows us to clearly see the regions of the input space originated with EAT(); however, this is impossible with more than two variables (one input and one output). For multiple input and / or output scenarios, the typical tree-structure showing the relationships between the predicted and predictive variables, is given.

In each node, we can obtain the following information:

Furthermore, the nodes are colored according to the variable by which the division is performed or they are black, in the case of being a leaf node.

plotEAT(object)

Below are the 3 ways to control the size of a tree model: numStop, max.depth and max.leaves.

Size control by numStop

reduced_model1 <- EAT(data = PISAindex, 
                      x = c(6, 7, 8, 12, 17), # input
                      y = 3:5, # output
                      numStop = 9)
plotEAT(object = reduced_model1)

# Leaf nodes: 8
# Depth: 6

Size control by max.depth

reduced_model2 <- EAT(data = PISAindex, 
                      x = c(6, 7, 8, 12, 17), # input
                      y = 3:5, # output
                      numStop = 9,
                      max.depth = 5)
plotEAT(object = reduced_model2)

# Leaf nodes: 6
# Depth: 5

Size control by max.leaves

reduced_model3 <- EAT(data = PISAindex, 
                      x = c(6, 7, 8, 12, 17), # input
                      y = 3:5, # output
                      numStop = 9,
                      max.leaves = 4)
plotEAT(object = reduced_model3)

# Leaf nodes: 4
# Depth: 3

EAT tuning

In this section, the PISAindex database is divided into a training subset with 70% of the DMUs and a test subset with the remaining 30%. Next, the bestEAT function is applied to find the value of the hyperparameters that minimize the error calculated on the test sample from an Efficiency Analysis Trees fitted with the training sample.

n <- nrow(PISAindex) # Observations in the dataset
selected <- sample(1:n, n * 0.7) # Training indexes
training <- PISAindex[selected, ] # Training set
test <- PISAindex[- selected, ] # Test set

bestEAT()

The bestEAT function requires a training set (training) on which to model an Efficiency Analysis Trees model (with cross-validation) and a test set (test) on which to calculate the error. The number of trees built is given by the number of different combinations that can be given by the numStop, fold, max.depth and max.leaves arguments. Note that it is not possible to enter NULL and a certain value in max.depth or max.leaves arguments at the same time. bestEAT() returns a data frame with the following columns:

bestEAT(training, test,
        x, y,
        numStop = 5,
        fold = 5,
        max.depth = NULL,
        max.leaves = NULL,
        na.rm = TRUE)

For example, if the arguments numStop = {3, 5, 7} and fold = {5, 7} are entered, 6 models of Efficiency Analysis Trees are constructed with {numStop = 3, fold = 5}, {numStop = 3, fold = 7}, {numStop = 5, fold = 5}, {numStop = 5, fold = 7}, {numStop = 7, fold = 5} and {numStop = 7, fold = 7}.

Tuning for:

S_PISA + R_PISA + M_PISA ~ NBMC + WS + S + HW + AAE.

numStop = {3, 5, 7} and fold = {5, 7}

bestEAT(training = training, 
        test = test,
        x = c(6, 7, 8, 12, 17),
        y = 3:5,
        numStop = c(3, 5, 7),
        fold = c(5, 7))

The best Efficiency Analysis Trees is given by the hyperparameters {numStop = 3, fold = 7} with RMSE = 56.82 and 24 leaf nodes. However, this model is too complex. Therefore, we select the model with parameters {numStop = 7, fold = 5} with RMSE = 59.14 but with only 10 leaf nodes. Now we check the results of this model.

bestEAT_model <- EAT(data = PISAindex,
                     x = c(6, 7, 8, 12, 17),
                     y = 3:5,
                     numStop = 7,
                     fold = 5)
summary(bestEAT_model)

Efficiency scores. Graphical representation.

Efficiency Analysis Trees model: efficiencyEAT()

The efficiency scores are numerical values that indicate the degree of efficiency of a set of DMUs. A dataset (data) and the corresponding indexes of input(s) (x) and output(s) (y) must be entered. It is recommended that the dataset with the DMUs whose efficiency is to be calculated coincide with those used to estimate the frontier. However, it is also possible to calculate the efficiency scores for a new dataset. The efficiency scores are calculated using the mathematical programming model included in the argument score_model. The following models are available:

If FDH = TRUE, scores are also calculated through a FDH model. Finally, a brief summary of the distribution of the scores calculated for each model is also included.

For this section, the previously created single_model is used:

efficiencyEAT(data, x, y, 
              object,
              score_model,
              digits = 3,
              FDH = TRUE,
              na.rm = TRUE)
# single_model <- EAT(data = PISAindex, 
                    # x = 15,
                    # y = 3)

scores_EAT <- efficiencyEAT(data = PISAindex,
                            x = 15, 
                            y = 3,
                            object = single_model, 
                            scores_model = "BCC.OUT",
                            digits = 3,
                            FDH = TRUE,
                            na.rm = TRUE)
scores_EAT2 <- efficiencyEAT(data = PISAindex,
                             x = 15, 
                             y = 3,
                             object = single_model, 
                             scores_model = "BCC.INP",
                             digits = 3,
                             FDH = TRUE,
                             na.rm = TRUE)

Convex Efficiency Analysis Trees model: efficiencyCEAT()

efficiencyCEAT() returns the efficiency scores for the convex frontier obtained through an Efficiency Analysis Trees model. In this case, if DEA = TRUE, scores are also calculated through a DEA model.

efficiencyCEAT(data, x, y, 
               object,
               score_model,
               digits = 3,
               DEA = TRUE,
               na.rm = TRUE)
scores_CEAT <- efficiencyCEAT(data = PISAindex,
                              x = 15, 
                              y = 3,
                              object = single_model, 
                              scores_model = "BCC.INP",
                              digits = 3,
                              DEA = TRUE,
                              na.rm = TRUE)

efficiencyJitter()

efficiencyJitter returns a jitter plot from ggplot2. This graphic shows how DMUs are grouped into leaf nodes in a model built using the EAT function. Each leaf node groups DMUs with the same level of resources. The dot and the black line represent, respectively, the mean value and the standard deviation of the scores of its node. Additionally, efficient DMU labels are always displayed based on the model entered in the score_model argument. Finally, the user can specify an upper bound (upb) and a lower bound (lwb) in order to show, in addition, the labels whose efficiency score lies between them. Scores from a convex Efficiency Analysis Tree (CEAT) model can also be used.

efficiencyJitter(object, df_scores,
                 scores_model,
                 lwb = NULL, upb = NULL)
efficiencyJitter(object = single_model,
                 df_scores = scores_EAT$EAT_BCC_OUT,
                 scores_model = "BCC.OUT",
                 lwb = 1.2)
efficiencyJitter(object = single_model,
                 df_scores = scores_EAT2$EAT_BCC_INP,
                 scores_model = "BCC.INP",
                 upb = 0.65)

Graphically, for a single input and output scenario it is observed that if the BCC models are used to obtain the efficiency scores:

# frontier <- frontier(object = single_model,
                     # FDH = TRUE, 
                     # observed.data = TRUE,
                     # rwn = TRUE)

plot(frontier)

efficiencyDensity()

efficiencyDensity() returns a density plot from ggplot2. In this way, the similarity between the scores obtained by the different available methodologies can be verified.

efficiencyDensity(df_scores,
                  model = c("EAT", "FDH"))

In our example:

efficiencyDensity(df_scores = scores_EAT[, 3:4],
                  model = c("EAT", "FDH"))

efficiencyDensity(df_scores = scores_CEAT[, 3:4],
                  model = c("CEAT", "DEA"))

The curse of dimensionality

When the ratio of the sample size and the number of variables (inputs and outputs) is low, the standard methods of efficiency analysis (specially FDH) tend to evaluate a large number of DMUs as technically efficient. This problem is known as the curse of dimensionality. To show it, the efficiency scores of the multioutput_model (section 2) with 16 variables and 72 DMUs are calculated:

# multioutput_model <- EAT(data = PISAindex, 
                         # x = 6:18, 
                         # y = 3:5
                         # ) 

cursed_scores <- efficiencyEAT(data = PISAindex,
                               x = 6:18, 
                               y = 3:5,
                               object = multioutput_model,
                               scores_model = "BCC.OUT",
                               digits = 3,
                               FDH = TRUE)
efficiencyDensity(df_scores = cursed_scores[, 17:18],
                  model = c("EAT", "FDH"))

Random Forest

RFEAT()

Random Forest + Efficiency Analysis Trees (RFEAT) has also been developed with the aim of providing a greater stability to the results obtained by the EAT function. The RFEAT function requires the data containing the variables for the analysis, x and y corresponding to the inputs and outputs indexes respectively, the minimum number of observations in a node for a split to be attempted (numStop) and na.rm to ignore observations with NA cells. All these arguments are used for the construction of the m individual Efficiency Analysis Trees that make up the random forest. Finally, the argument s_mtry indicates the number of inputs that can be randomly selected in each split. It can be set as any integer although there are also certain predefined values. Being, $n_{x}$ the number of inputs, $n_{y}$ the number of outputs and $n(t)$ the number of observations in a node, the available options in s_mtry are:

The function returns a RFEAT object.

RFEAT(data, x, y,
      numStop = 5, m = 50,
      s_mtry = "BRM",
      na.rm = TRUE)
forest <- RFEAT(data = PISAindex, 
                x = 6:18, # input 
                y = 3:5, # output
                numStop = 5, 
                m = 30,
                s_mtry = "BRM",
                na.rm = TRUE)
print(forest)

plotRFEAT()

plotRFEAT() returns the Out-Of-Bag error for the training dataset and a forest consisting of k trees. Note that the OOB error of early forests suffers from great variability.

plotRFEAT(forest)

rankingRFEAT()

As in rankingEAT(), the rankingRFEAT function allows an importance score for variables using a RFEAT object to be calculated.

rankingRFEAT(object, 
             barplot = TRUE, 
             digits = 2,
)

For example (this function is usually computationally exhaustive, thus a database reduction is carried out):

forestReduced <- RFEAT(data = PISAindex, 
                       x = c(6, 7, 8, 12, 17), 
                       y = 3:5,
                       numStop = 5, 
                       m = 30,
                       s_mtry = "BRM",
                       na.rm = TRUE)
rankingRFEAT(object = forestReduced, 
             barplot = TRUE,
             digits = 2)

bestRFEAT()

As in bestEAT(), the bestRFEAT function is applied to find the optimal hyperparameters that minimize the root mean square error (RMSE) calculated on the test sample. In this case, the available hyperparameters are numStop, m and s_mtry.

bestRFEAT(training, test,
          x, y,
          numStop = 5,
          m = 50,
          s_mtry = c("5", "BRM"),
          na.rm = TRUE)

In our example:

# n <- nrow(PISAindex)
# selected <- sample(1:n, n * 0.7)
# training <- PISAindex[selected, ]
# test <- PISAindex[- selected, ]

bestRFEAT(training = training,
          test = test,
          x = c(6, 7, 8, 12, 17),
          y = 3:5,
          numStop = c(5, 10), # set of possible numStop
          m = c(20, 30), # set of possible m
          s_mtry = c("1", "BRM")) # set of possible s_mtry 

The best Random Forest + Efficiency Analysis Trees model is given by the hyperparameters {numStop = 5, m = 20, s_mtry = "BRM"} with RMSE = 54.18.

bestRFEAT_model <- RFEAT(data = PISAindex,
                         x = c(6, 7, 8, 12, 17),
                         y = 3:5,
                         numStop = 5,
                         m = 20,
                         s_mtry = "BRM")

efficiencyRFEAT()

As in efficiencyEAT(), the efficiencyRFEAT function returns the efficiency scores for a set of DMUs. However, in this case it is only available for the BCC model with output orientation. Again, the FDH scores can be requested using FDH = TRUE.

efficiencyRFEAT(data, x, y,
                object,
                digits = 2,
                FDH = TRUE)

In our example:

scoresRF <- efficiencyRFEAT(data = PISAindex,
                            x = c(6, 7, 8, 12, 17), # input
                            y = 3:5, # output
                            object = bestRFEAT_model,
                            FDH = TRUE)

Predictions

predictEAT() and predictRFEAT() return a data frame with the data and the expected output for a set of observations using Efficiency Analysis Trees and Random Forest + Efficiency Analysis Trees techniques respectively. In both cases, newdata refers to a data frame and x the set of inputs to be used. Regarding the object argument, in the first case it corresponds to an EAT object and in the second case to a RFEAT object.

In predictions using an EAT object, only one Efficiency Analysis Tree is used. However, for the RFEAT model, the output is predicted by each of the m individual trees trained and subsequently the mean value of all predictions is obtained.

predictEAT()

predictEAT(object, newdata, x)

predictRFEAT()

predictRFEAT(object, newdata, x)

Finally, an example is shown that aims to show the different predictions made by each of the methods (both have been previously defined):

# bestEAT_model <- EAT(data = PISAindex,
                     # x = c(6, 7, 8, 12, 17),
                     # y = 3:5,
                     # numStop = 5,
                     # fold = 5)

# bestRFEAT_model <- EAT(data = PISAindex,
                       # x = c(6, 7, 8, 12, 17),
                       # y = 3:5,
                       # numStop = 3,
                       # m = 30,
                       # s_mtry = 'BRM')

predictions_EAT <- predictEAT(object = bestEAT_model,
                              newdata = PISAindex,
                              x = c(6, 7, 8, 12, 17))

predictions_RFEAT <- predictRFEAT(object = bestRFEAT_model,
                                  newdata = PISAindex,
                                  x = c(6, 7, 8, 12, 17))
predictions <- cbind(PISAindex[, 3], PISAindex[, 4], PISAindex[, 5], 
                     predictions_EAT[, 6], predictions_EAT[, 7], predictions_EAT[, 8],
                     predictions_RFEAT[, 6], predictions_RFEAT[, 7], predictions_RFEAT[, 8]) %>%
  as.data.frame()

names(predictions) = c("S_PISA", "R_PISA", "M_PISA",
                       "S_EAT", "R_EAT", "M_EAT",
                       "S_RFEAT", "R_RFEAT", "M_RFEAT")

kableExtra::kable(predictions) %>%
  kableExtra::kable_styling("striped", full_width = F) %>%
  kableExtra::column_spec(c(1, 2, 3), background = "#DBFFD6") %>%
  kableExtra::column_spec(c(4, 5, 6), background = "#FFFFD1") %>%
  kableExtra::column_spec(c(7, 8, 9), background = "#FFCCF9")
new <- data.frame(NBMC = c(90, 95, 93),
                  WS = c(87, 92, 99),
                  S = c(93, 90, 90),
                  HW = c(90, 91, 92),
                  AAE = c(88, 91, 89))

predictions_EAT <- predictEAT(object = bestEAT_model,
                              newdata = new,
                              x = 1:5)


MiriamEsteve/EAT documentation built on Jan. 18, 2022, 6:55 p.m.