This vignette aims at presenting the features of the Channel Binary Entropy Triangle (CBET) for the evaluation of supervised classification in an incremental manner.
library(tidyverse) # That famous Mr. Wickham! library(caret) # To build the classifiers. library(e1071) library(mlbench) # Many databases for ML tasks library(vcd) # Categorical benchmarks library(candisc) # Wine dataset library(entropies) # Processing and visualizing joint entropies library(compositions)# Statistics work differently on compositional data
Some top level switches and options gathered in one place.
debugLevel <- 0 # Debug level 0-non-existent, 1-minimal, the greater the more verbose. fancy <- TRUE # set this for nicer on-screen visualization. #fancy <- FALSE # Set this for either printing matter or more austere plots. getPlot <- TRUE # Flag to obtain plots for publication. getPlot <- FALSE #Comment to get .jpeg files rather than plots of ets. knitr::opts_chunk$set(comment=NA, fig.width=6, fig.height=4) if (getPlot) knitr::opts_chunk$set(dev = 'pdf') # plots in pdf, better for publication
Some choices for visualization parameters, and primitives.
splitShapesForTypes <- c("X"=4, "Y"=1, "XY"=20) #To draw split diagrams # Naive transformation from factors to numbers in 0 to num.factors - 1 factor.as.numeric <- function(f){ nums <- as.numeric(f) return(nums - min(nums)) }
data(datasets) if (getPlot){# For papers, it helps to have the table in latex. library(xtable) print.xtable(xtable(dplyr::select(datasets, name, K, n, m))) }
## ----dataset processing---------------------------------- #data(datasets) if(interactive()){#latex-ing the table for publications library(xtable) ds4latexing <- datasets %>% select(name, K, n, m) row.names(ds4latexing) <- NULL names(ds4latexing) <- c("Dataset Name", "class card.", "num. features", "num. instances") thisLatex <- xtable(ds4latexing, caption="Some datasets considered in this study", label="tab:datasets") align(thisLatex) <- xalign(thisLatex) thisLatex } # Uncomment the name of the database to be analyzed # dsName <- "Ionosphere" dsName <- "iris" # dsName <- "BreastCancer"#Supply the name of the database to be analyzed#FACTORS # dsName <- "Wine" # dsName <- "Glass"#Cannot take logarithms for PCA: zeros returns -Inf #dsName <- "Arthritis"#It has non-numeric factors. # dsName <- "Sonar" dsRecord <- filter(datasets, name == dsName) ds <- loadDataset(dsRecord$name,dsRecord$packName)
We'll use the r dsName
data throughout this vignette.
#id columns, if existent in dsRecord$idNumber # log transform of everything but the class and any id if existant. if (!is.na(dsRecord$idNumber)){ ds <- ds[,-dsRecord$idNumber] } #class column ds.classNum <- which(names(ds)==dsRecord$className) #take away the class, but keep it just in case. class.ds <- ds[, ds.classNum]#saving the class. Warning A FACTOR! ds <- ds[,-ds.classNum] ds <- ds %>% #transform factors to number mutate_if(is.factor,factor.as.numeric) %>% # Dispose of columns with NaN select_if(function(v) !any(is.na(v))) %>% # Dispose of constant columns: they carry no information select_if(function(v)(var(v) > 0)) ncols <- ncol(ds)#Mnemonic shortcut: num of columns dsDiscretized <- infotheo::discretize(ds, disc="equalwidth") if (dsName != "Ionosphere"){ log.ds <- log(ds)#this has to be made conditional on the database log.dsDiscretized <- infotheo::discretize(log.ds) #TODO: try to get rid of annoying warnings each time entropy is called. } X <- as.matrix(ds) Y <- class.ds classes <- unique(Y) numC <- length(classes) print(sprintf("%s has %d classes with distribution: ", dsName, numC)) summary(Y)
Throughout this vignette we use a k-nearest neigbour classifier.
In this initial evaluation, we first carry out a basic random partitioning of the data.
set.seed(2117) inTrain <- createDataPartition(y=Y, p=0.80, # Tries to do stratified sampling list=FALSE) trainX <- X[inTrain,]; trainY <- Y[inTrain] testX <- X[-inTrain,]; testY <- Y[-inTrain] #Basic model fitting fit <- train(x=trainX, y=trainY, method="knn", tuneLength = 15, preProcess = c("center", "scale"))
Evaluation proceeds by obtaining the confusion matrices for the train and test sets...
## obtain a training caret::confusion matrix trCM <- confusionMatrix(predict(fit,trainX), trainY) #trEntropies <- trCoords <- jentropies(t(trCM$table)) ## prediction and the test confusion matrix predicted <- predict(fit, testX) teCM <- confusionMatrix(predicted,testY) #teEntropies <- teCoords <- jentropies(t(teCM$table))
And then printing the results in the Entropy Triangle for a single classification experiment:
experiments <- rbind(cbind(trCoords, Phase="train", method="knn"), cbind(teCoords, Phase="test", method="knn") ) experiments <- cbind(dSet=dsName, experiments) # The basic plot for the entropy triangle training and testX in different colours and glyphs gp <- ggmetern(ed=experiments %>% filter(type=="XY"), fancy) + geom_point(aes(colour=Phase, shape=dSet), size=1) + labs(shape="Dataset") + scale_colour_brewer(palette="Set1") gp
Note that, at least for r dsName
(and iris), there is a suspicious behaviour in the plot in that the classifier achieves a better information transfer (correlated with accuracy) in test than in training.
print(sprintf("Training accuracy= %f vs. Testing accuracy=%f ", trCM$overall[1], teCM$overall[1]))
This is part of the "evaluation paradox" for classifications: since the test must have a higher variance, there will be instances of train-test partitions where the performance on the testing set will be higher that on the training set. This is partially solved with n-fold validation.
To confirm this intuition and get all the value for our coin in the entropy triangle, in the following, we use n-fold validation to visualize several experiments and their mean performance.
First we create the folds: the number of folds is a parameter of this script.
numFolds <- 5 set.seed(1717) # For reproducibility folds <- createFolds(Y, numFolds) print("Check that the sampling was stratified...") for(i in 1:numFolds){ print(summary(Y[folds[[i]]])) } summary(Y)
Run the experiments
models <- c("knn") #c("knn", "logreg") results <- data.frame() for (i in 1:numFolds){ for (m in models){ # 1. select training and testX data and classes trainObs <- unlist(folds[-i]) testObs <- folds[[i]] trainX <- X[trainObs, ]; trainY <- Y[trainObs] testX <- X[testObs, ]; testY <- Y[testObs] # 2. Fit the model with the model <- train(x=trainX, y=trainY, method=m, tuneLength = 15, preProcess = c("center", "scale")) # 3. Estimate the labels for the train set: confusion matrix, entropies, etc. trainYhat <- predict(model, trainX) trainCM <- confusionMatrix(trainYhat, trainY) print(trainCM$table) # 4. Estimate the labels for the test set testYhat <- predict(model, testX) testCM <- confusionMatrix(testYhat, testY) print(testCM$table) # 5. Gather results for # CAVEAT: our framework supposes that in confusion matrices, rows are indexed by # the reference, hence the transposition below results <- rbind(results, evaluate(t(trainCM$table)) %>% mutate(Fold=i,method=m, Phase="train", Acc=trainCM$overall[1]), evaluate(t(testCM$table)) %>% mutate(Fold=i,method=m, Phase="test", Acc=testCM$overall[1]) ) print(sprintf("Fold %d, method %s Train accuracy = %f\t Test accuracy= %f", i, m, trainCM$overall[1],testCM$overall[1]) ) } } results <- cbind(dSet=dsName,results)#Watch it! This is only possible at last!
We show the plot for the result on a per-plot basis.
eT <- ggmetern(ed=results %>% filter(type=="XY"), fancy) + geom_point(aes(colour=Phase, shape=dSet), size=2) + labs(shape="Dataset") + # scale_colour_manual(values=c("blue","red")) # Don't trust the training, that is the red scale_colour_manual(values=c("red", "blue")) # Don't trust the training, that is the blue eT
Clearly, some test results are better than training results. What about centrality and dispersion measures?
results %>% filter(type=="XY") %>% select(-one_of(c("method"))) %>% group_by(type, Phase) %>% summarise(meanAcc=mean(Acc), sdAcc=sd(Acc), meanEMA = mean(EMA), sdEMA = sd(EMA))
This agrees with the theory that insists on the variance of the testing instances being higher.
Note that the Entropy-Modified Accuracy (EMA) is a more pessimistic alternative measure to Accuracy.
To use the split triangle to advantage we have to use an unbalanced dataset, e.g. Glass or Arthritis.
First we look at the entropies of the different folds.
eTsplit <- ggmetern(ed=results %>% filter(type!="XY"), fancy) + geom_point(aes(colour=Phase, shape=type), size=2) + labs(shape="Split Entropies") + scale_shape_manual(values=splitShapesForTypes) + scale_colour_manual(values=c("train"="blue","test"="red")) eTsplit + geom_text(data=results %>% filter(type == "XY"), aes(label=Fold,color=Phase)) #color="blue") size=4, vjust=2, hjust=1)
The number of the fold appears in the place where the aggregate entropies appeared in the previous CBET.
We can see that the input entropies $H_{P_X}$ are much more concentrated. This is because the sampling was stratified. While in theory it should obtain similar reference class distributions for all folds, e.g. the "x" above should lie in a line parallel to the left side of the triangle, in practice the stratification is imperfect.
Also the predicted class distributions are much less entropic than the reference class distributions: their uncertainty has decreased, contrary to the data processing inequality. The reason for this is the classifier concentrating on majority classes, i.e. specializing, which increases the Accuracy.
To better view this phenomenon, we obtain the (compositional) means of the entropy values.
meanCompositions <- data.frame() for(dsNam in unique(results$dSet)){ for(ph in unique(results$Phase)){ for(eType in unique(results$type)){ meanCompositions <- rbind(meanCompositions, cbind(dSet=dsNam, Phase=ph, type=eType, data.frame(as.list( mean(acomp( results %>% filter(dSet==dsNam, Phase==ph, type==eType) %>% select(one_of( c("DeltaH_P", "M_P", "VI_P")))) ))) ) ) } } } meanCompositions %>% filter(type=="XY")
And now we add it with a different glyph but the same colors.
eTsplit + geom_point(data=meanCompositions, aes(colour=Phase, shape=type), size=4)
This shows that there is very little mutual information transfer. It was made worse by the classifier specializing, that is, increasing $\Delta H_{P_{XY}}$ at the expense of mutual information.
All in all, r models
is a very bad classifier for r dsName
.
More information about the evaluation of classifiers with the Channel Binary Entropy Triangle can be found in
library(bibtex) print(citation("entropies")['val:pel:14a'], style="text")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.