In FJValverde/entropies: Plotting ternary entropy diagrams for joint distributions, contingency matrices, etc

This vignette aims at presenting the features of the Channel Binary Entropy Triangle (CBET) for the evaluation of supervised classification in an incremental manner.

Environment construction

library(tidyverse) # That famous Mr. Wickham!
library(caret)    # To build the classifiers.
library(e1071)
library(mlbench)  # Many databases for ML tasks
library(vcd)       # Categorical benchmarks
library(candisc)   # Wine dataset
library(entropies) # Processing and visualizing joint entropies
library(compositions)# Statistics work differently on compositional data

Some top level switches and options gathered in one place.

debugLevel <- 0 # Debug level 0-non-existent, 1-minimal, the greater the more verbose.
fancy <- TRUE  # set this for nicer on-screen visualization.
#fancy <- FALSE # Set this for either printing matter or more austere plots.
getPlot <- TRUE # Flag to obtain plots for publication. 
getPlot <- FALSE #Comment to get .jpeg files rather than plots of ets.
knitr::opts_chunk$set(comment=NA, fig.width=6, fig.height=4)
if (getPlot)
    knitr::opts_chunk$set(dev = 'pdf') # plots in pdf, better for publication

Some choices for visualization parameters, and primitives.

splitShapesForTypes <- c("X"=4, "Y"=1, "XY"=20) #To draw split diagrams
# Naive transformation from factors to numbers in 0 to num.factors - 1
factor.as.numeric <- function(f){
    nums <- as.numeric(f)
    return(nums - min(nums))
}

Datasets available for entropy analysis in this package

data(datasets)
if (getPlot){# For papers, it helps to have the table in latex.
    library(xtable)
    print.xtable(xtable(dplyr::select(datasets, name, K, n, m)))
}

## ----dataset processing----------------------------------
#data(datasets)
if(interactive()){#latex-ing the table for publications
    library(xtable)
    ds4latexing <- datasets %>% select(name, K, n, m) 
    row.names(ds4latexing) <- NULL
    names(ds4latexing) <- c("Dataset Name", 
                            "class card.", 
                            "num. features", 
                            "num. instances")
    thisLatex <- xtable(ds4latexing, 
                    caption="Some datasets considered in this study",
                    label="tab:datasets")
    align(thisLatex) <- xalign(thisLatex)
    thisLatex
}


# Uncomment the name of the database to be analyzed
# dsName <- "Ionosphere"
dsName <- "iris"
# dsName <- "BreastCancer"#Supply the name of the database to be analyzed#FACTORS
# dsName <- "Wine"
# dsName <- "Glass"#Cannot take logarithms for PCA: zeros returns -Inf
#dsName <- "Arthritis"#It has non-numeric factors.
# dsName <- "Sonar"
dsRecord <-  filter(datasets, name == dsName)
ds <- loadDataset(dsRecord$name,dsRecord$packName)

We'll use the r dsName data throughout this vignette.

Classifier design

Basic data from the set for classification

#id columns, if existent in dsRecord$idNumber
# log transform of everything but the class and any id if existant. 
if (!is.na(dsRecord$idNumber)){
    ds <- ds[,-dsRecord$idNumber]
}
#class column
ds.classNum <- which(names(ds)==dsRecord$className)
#take away the class, but keep it just in case.
class.ds <- ds[, ds.classNum]#saving the class. Warning A FACTOR!
ds <- ds[,-ds.classNum]
ds <- ds %>%     
    #transform factors to number
    mutate_if(is.factor,factor.as.numeric) %>%
    # Dispose of columns with NaN
    select_if(function(v) !any(is.na(v))) %>% 
    # Dispose of constant columns: they carry no information
    select_if(function(v)(var(v) > 0))
ncols <- ncol(ds)#Mnemonic shortcut: num of columns
dsDiscretized <- infotheo::discretize(ds, disc="equalwidth")
if (dsName != "Ionosphere"){
    log.ds <- log(ds)#this has to be made conditional on the database
    log.dsDiscretized <- infotheo::discretize(log.ds)
    #TODO: try to get rid of annoying warnings each time entropy is called. 
}
X <- as.matrix(ds)
Y <- class.ds
classes <- unique(Y)
numC <- length(classes)
print(sprintf("%s has %d classes with distribution: ", dsName, numC))
summary(Y)

Design a simple classifier

Throughout this vignette we use a k-nearest neigbour classifier.

In this initial evaluation, we first carry out a basic random partitioning of the data.

set.seed(2117)
inTrain <- createDataPartition(y=Y,
                               p=0.80, # Tries to do stratified sampling
                               list=FALSE)
trainX <- X[inTrain,]; trainY <- Y[inTrain]
testX <- X[-inTrain,]; testY <- Y[-inTrain]
#Basic model fitting
fit <- train(x=trainX, y=trainY, 
              method="knn",
              tuneLength = 15,
              preProcess = c("center", "scale"))

Evaluation using the CBET

Evaluation proceeds by obtaining the confusion matrices for the train and test sets...

## obtain a training caret::confusion matrix
trCM <- confusionMatrix(predict(fit,trainX), trainY)
#trEntropies <- 
trCoords <- jentropies(t(trCM$table))

## prediction and the test confusion matrix
predicted <- predict(fit, testX)
teCM <- confusionMatrix(predicted,testY)
#teEntropies <- 
teCoords <- jentropies(t(teCM$table))

And then printing the results in the Entropy Triangle for a single classification experiment:

experiments <- rbind(cbind(trCoords, Phase="train", method="knn"),
                     cbind(teCoords, Phase="test", method="knn")
                     )
experiments <- cbind(dSet=dsName, experiments)
# The basic plot for the entropy triangle training and testX in different colours and glyphs
gp <- ggmetern(ed=experiments %>% filter(type=="XY"), fancy) +
    geom_point(aes(colour=Phase, shape=dSet), size=1)  +
    labs(shape="Dataset") + 
    scale_colour_brewer(palette="Set1")
gp

Note that, at least for r dsName (and iris), there is a suspicious behaviour in the plot in that the classifier achieves a better information transfer (correlated with accuracy) in test than in training.

print(sprintf("Training accuracy= %f vs. Testing accuracy=%f ", trCM$overall[1], teCM$overall[1]))

This is part of the "evaluation paradox" for classifications: since the test must have a higher variance, there will be instances of train-test partitions where the performance on the testing set will be higher that on the training set. This is partially solved with n-fold validation.

A better picture with n-fold validation

To confirm this intuition and get all the value for our coin in the entropy triangle, in the following, we use n-fold validation to visualize several experiments and their mean performance.

First we create the folds: the number of folds is a parameter of this script.

numFolds <- 5
set.seed(1717) # For reproducibility
folds <- createFolds(Y, numFolds)
print("Check that the sampling was stratified...")
for(i in 1:numFolds){
    print(summary(Y[folds[[i]]]))
}
summary(Y)

Run the experiments

models <- c("knn") #c("knn", "logreg") 
results <- data.frame()
for (i in 1:numFolds){
    for (m in models){
        # 1. select training and testX data and classes
        trainObs <- unlist(folds[-i])
        testObs <- folds[[i]]
        trainX <- X[trainObs, ]; trainY <- Y[trainObs]
        testX <- X[testObs, ]; testY <- Y[testObs]
        # 2. Fit the model with the 
        model <- train(x=trainX, y=trainY, 
                       method=m,
                       tuneLength = 15,
                       preProcess = c("center", "scale"))
        # 3. Estimate the labels for the train set: confusion matrix, entropies, etc.
        trainYhat <- predict(model, trainX)
        trainCM <- confusionMatrix(trainYhat, trainY)
        print(trainCM$table)
        # 4. Estimate the labels for the test set
        testYhat <- predict(model, testX)
        testCM <- confusionMatrix(testYhat, testY)
        print(testCM$table)
        # 5. Gather results for 
        # CAVEAT: our framework supposes that in confusion matrices, rows are indexed by 
        # the reference, hence the transposition below
        results <- rbind(results, 
                         evaluate(t(trainCM$table)) %>% mutate(Fold=i,method=m, Phase="train",
                               Acc=trainCM$overall[1]),
                         evaluate(t(testCM$table)) %>% mutate(Fold=i,method=m, Phase="test",
                               Acc=testCM$overall[1])
        )
        print(sprintf("Fold %d, method %s Train accuracy = %f\t Test accuracy= %f", 
                      i, m, trainCM$overall[1],testCM$overall[1])
        )
    }
}
results <- cbind(dSet=dsName,results)#Watch it! This is only possible at last!

We show the plot for the result on a per-plot basis.

eT <- ggmetern(ed=results %>% filter(type=="XY"), fancy) + 
    geom_point(aes(colour=Phase, shape=dSet), size=2)  +
    labs(shape="Dataset") + 
    # scale_colour_manual(values=c("blue","red")) # Don't trust the training, that is the red
    scale_colour_manual(values=c("red", "blue")) # Don't trust the training, that is the blue
eT

Clearly, some test results are better than training results. What about centrality and dispersion measures?

results %>% filter(type=="XY") %>% select(-one_of(c("method"))) %>% 
    group_by(type, Phase) %>%
    summarise(meanAcc=mean(Acc), sdAcc=sd(Acc), meanEMA = mean(EMA), sdEMA = sd(EMA))

This agrees with the theory that insists on the variance of the testing instances being higher.

Note that the Entropy-Modified Accuracy (EMA) is a more pessimistic alternative measure to Accuracy.

Visualization with the split triangle

To use the split triangle to advantage we have to use an unbalanced dataset, e.g. Glass or Arthritis.

First we look at the entropies of the different folds.

eTsplit <- ggmetern(ed=results %>% filter(type!="XY"), fancy) + 
    geom_point(aes(colour=Phase, shape=type), size=2)  +
    labs(shape="Split Entropies") + 
    scale_shape_manual(values=splitShapesForTypes) +
    scale_colour_manual(values=c("train"="blue","test"="red"))

eTsplit + geom_text(data=results %>% filter(type == "XY"), aes(label=Fold,color=Phase))
              #color="blue") size=4, vjust=2, hjust=1)

The number of the fold appears in the place where the aggregate entropies appeared in the previous CBET.

We can see that the input entropies $H_{P_X}$ are much more concentrated. This is because the sampling was stratified. While in theory it should obtain similar reference class distributions for all folds, e.g. the "x" above should lie in a line parallel to the left side of the triangle, in practice the stratification is imperfect.

Also the predicted class distributions are much less entropic than the reference class distributions: their uncertainty has decreased, contrary to the data processing inequality. The reason for this is the classifier concentrating on majority classes, i.e. specializing, which increases the Accuracy.

To better view this phenomenon, we obtain the (compositional) means of the entropy values.

meanCompositions <- data.frame()
for(dsNam in unique(results$dSet)){
    for(ph in unique(results$Phase)){
        for(eType in unique(results$type)){
            meanCompositions <- 
                rbind(meanCompositions,
                      cbind(dSet=dsNam, Phase=ph, type=eType,
                            data.frame(as.list(
                                mean(acomp(
                                    results %>% 
                                    filter(dSet==dsNam, 
                                           Phase==ph, 
                                           type==eType) %>%
                                    select(one_of(
                                        c("DeltaH_P", "M_P", "VI_P"))))
                                     )))
                            )
                      )
        }
    }
}
meanCompositions %>% filter(type=="XY")

And now we add it with a different glyph but the same colors.

eTsplit + geom_point(data=meanCompositions, aes(colour=Phase, shape=type), size=4)

This shows that there is very little mutual information transfer. It was made worse by the classifier specializing, that is, increasing $\Delta H_{P_{XY}}$ at the expense of mutual information.

All in all, r models is a very bad classifier for r dsName.

Postscriptum

More information about the evaluation of classifiers with the Channel Binary Entropy Triangle can be found in

library(bibtex)
print(citation("entropies")['val:pel:14a'], style="text")

Session information

sessionInfo()

FJValverde/entropies documentation built on Oct. 12, 2023, 10:17 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

FJValverde/entropies
Plotting ternary entropy diagrams for joint distributions, contingency matrices, etc

In FJValverde/entropies: Plotting ternary entropy diagrams for joint distributions, contingency matrices, etc

Environment construction

Datasets available for entropy analysis in this package

Classifier design

Basic data from the set for classification

Design a simple classifier

Evaluation using the CBET

A better picture with n-fold validation

Visualization with the split triangle

Postscriptum

Session information

R Package Documentation

Browse R Packages

We want your feedback!

FJValverde/entropies Plotting ternary entropy diagrams for joint distributions, contingency matrices, etc

In FJValverde/entropies: Plotting ternary entropy diagrams for joint distributions, contingency matrices, etc

Environment construction

Datasets available for entropy analysis in this package

Classifier design

Basic data from the set for classification

Design a simple classifier

Evaluation using the CBET

A better picture with n-fold validation

Visualization with the split triangle

Postscriptum

Session information

R Package Documentation

Browse R Packages

We want your feedback!

FJValverde/entropies
Plotting ternary entropy diagrams for joint distributions, contingency matrices, etc