Feature or Variable Selection in Data Envelopment Analysis with Linear Programming
In adea: Alternate DEA Package

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library('adea')

Introduction

Variable selection in Data Envelopment Analysis (DEA) is a crucial consideration that requires careful attention before the results of an analysis can be applied in a real-world context. This is because the outcomes can vary significantly depending on the variables included in the model. Therefore, variable selection is a fundamental step in every DEA application.

One of the methods proposed for this is what is known as the feature selection method. This method constructs a linear programming problem to maximize some objective function related to the dmu efficiencies.

fsdea function implements the feature selection method in the article [@Benites2020]

Let's load and inspect the "tokyo_libraries" dataset using the following code:

data(tokyo_libraries)
head(tokyo_libraries)

Lets't take inputs as: Area.I1, Books.I2, Staff.I3 and Populations.I4variables and outputs as Regist.O1 and Borrow.O2

input <- tokyo_libraries[, 1:4]
output <- tokyo_libraries[, 5:6]

First, let's do a standard DEA analysis with:"

dea <- dea(input, output)
dea

Summarizing DEA, we can calculate the average efficiency:

summary(dea)

Reducing the number of variables

Suposse that we want a model with only 5 variables, then the following call does the job:

dea5v <- fsdea(input, output, nvariables = 5)
dea5v

We can calculate the average efficiency for the new model by summarizing it:

summary(dea5v)

deasummary <- summary(dea)
deaaverage <- deasummary$`Eff. Med`
dea5vsummary <- summary(dea5v)
dea5vaverage <- dea5vsummary$`Eff. Med`

Observe that average efficiency has decreased from r deaaverage to r dea5vaverage. This could indicate that the new model is not an improvement over the previous one.

To delve deeper into the results, we can plot the efficiencies:

plot(dea$eff, dea5v$eff)
abline(a = 0, b = 1)

The graph reveals that most efficiencies are very similar, a fact that is confirmed by the correlation coefficient r cor(dea$eff, dea5v$eff) which is very close to 1.

Reducing the number of outputs

In the previous case, reducing the number of variables led to a decrease in one input. We achieved the same result with the call fsdea(input, output, ninputs = 3). However, perhaps our goal is to reduce an output. Let's achieve this with the following call:

dea1o <- fsdea(input, output, noutputs = 1)
dea1o

dea1osummary <- summary(dea1o)
dea1oaverage <- dea1osummary$`Eff. Mean`

Observe that, again, average efficiency has decreased from r deaaverage to r dea1oaverage.

To delve deeper into the results, we can plot the new efficiencies:

plot(dea$eff, dea1o$eff)
abline(a = 0, b = 1)

In this case, the differences are more significant compared to the previous two models. This is confirmed by a lower correlation coefficient r cor(dea$eff, dea1o$eff).

All these could indicate that to enhance the model, it may be necessary to remove multiple variables simultaneously.