Feature or Variable Selection in Data Envelopment Analysis with Linear Programming

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library('adea')

Introduction

Variable selection in Data Envelopment Analysis (DEA) is a crucial consideration that requires careful attention before the results of an analysis can be applied in a real-world context. This is because the outcomes can vary significantly depending on the variables included in the model. Therefore, variable selection is a fundamental step in every DEA application.

One of the methods proposed for this is what is known as the feature selection method. This method constructs a linear programming problem to maximize some objective function related to the dmu efficiencies.

fsdea function implements the feature selection method in the article [@Benites2020]

Let's load and inspect the "tokyo_libraries" dataset using the following code:

data(tokyo_libraries)
head(tokyo_libraries)

Lets't take inputs as: Area.I1, Books.I2, Staff.I3 and Populations.I4variables and outputs as Regist.O1 and Borrow.O2

input <- tokyo_libraries[, 1:4]
output <- tokyo_libraries[, 5:6]

First, let's do a standard DEA analysis with:"

dea <- dea(input, output)
dea

Summarizing DEA, we can calculate the average efficiency:

summary(dea)

Reducing the number of variables

Suposse that we want a model with only 5 variables, then the following call does the job:

dea5v <- fsdea(input, output, nvariables = 5)
dea5v

We can calculate the average efficiency for the new model by summarizing it:

summary(dea5v)
deasummary <- summary(dea)
deaaverage <- deasummary$`Eff. Med`
dea5vsummary <- summary(dea5v)
dea5vaverage <- dea5vsummary$`Eff. Med`

Observe that average efficiency has decreased from r deaaverage to r dea5vaverage. This could indicate that the new model is not an improvement over the previous one.

To delve deeper into the results, we can plot the efficiencies:

plot(dea$eff, dea5v$eff)
abline(a = 0, b = 1)

The graph reveals that most efficiencies are very similar, a fact that is confirmed by the correlation coefficient r cor(dea$eff, dea5v$eff) which is very close to 1.

Reducing the number of outputs

In the previous case, reducing the number of variables led to a decrease in one input. We achieved the same result with the call fsdea(input, output, ninputs = 3). However, perhaps our goal is to reduce an output. Let's achieve this with the following call:

dea1o <- fsdea(input, output, noutputs = 1)
dea1o
dea1osummary <- summary(dea1o)
dea1oaverage <- dea1osummary$`Eff. Mean`

Observe that, again, average efficiency has decreased from r deaaverage to r dea1oaverage.

To delve deeper into the results, we can plot the new efficiencies:

plot(dea$eff, dea1o$eff)
abline(a = 0, b = 1)

In this case, the differences are more significant compared to the previous two models. This is confirmed by a lower correlation coefficient r cor(dea$eff, dea1o$eff).

All these could indicate that to enhance the model, it may be necessary to remove multiple variables simultaneously.

References



Try the adea package in your browser

Any scripts or data that you put into this service are public.

adea documentation built on Nov. 23, 2023, 3:02 p.m.