Variable Selection in Data Envelopment Analysis with ADEA Method
In adea: Alternate DEA Package

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library('adea')

Introduction

Variable selection in Data Envelopment Analysis (DEA) is a crucial consideration that requires careful attention before the results of an analysis can be applied in a real-world context. This is because the outcomes can vary significantly depending on the variables included in the model. Therefore, variable selection is a fundamental step in every DEA application.

ADEA provides a measure known as "load" to assess the contribution of a variable to a DEA model. In an ideal scenario where all variables contribute equally, all loads would be equal to 1. For instance, if the load of an output variable is 0.75, it signifies that its contribution is 75% of the average value for all outputs. A load value below 0.6 indicates that the variable's contribution to the DEA model is negligible.

For more information about loads see the help of the package \code{\link{adea}} or see [@Fernandez2018] and [@Villanueva2021].

Let's load and inspect the "tokyo_libraries" dataset using the following code:

data(tokyo_libraries)
head(tokyo_libraries)

Step wise variable selection

Two stepwise variable selection functions are provided. The first one eliminates variables one by one, creating a set of nested models. The following code sets up input and output variables and performs the call:

input <- tokyo_libraries[, 1:4]
output <- tokyo_libraries[, 5:6]
adea_hierarchical(input, output)

m <- adea_hierarchical(input, output)

The load of the first model r m$models[[6]]$load$load falls under the minimum significance level, indicating that Area.I1 can be removed from the model.

When a variable is removed, one would expect the load of all remaining variables to increase. However, this doesn't occur after the second model. Therefore, the third model is inferior to the second, and there is no statistical rationale for selecting it.

To avoid that, a second step wise selection variable is provided, the new call is as follows:

adea_parametric(input, output)

In both cases, all variables are considered for removal, but the load.orientation parameter allows for selecting which variables to include in the load analysis. You can choose input for only input variables, output for only output variables, or inoutput, which is the default value for all variables. The following call only considers output variables as candidate variables for removal:

adea_parametric(input, output, load.orientation = 'output')

Both adea_hierarchical and adea_parametric return a list called models, which contains all computed models and can be accessed using the following call:

m <- adea_hierarchical(input, output)
m4 <- m$models[[4]]
m4

where the number in square brackets represents the number of total variables in the model.

By default, when the print function is used with an ADEA model, it displays only efficiencies. The summary function provides a more comprehensive output: