Introduction

This vignette shows how to carry out Exploratory Data Analysis (EDA) on datasets prior to building classifiers, feature engineering or any other processing.

The objective is to form expectations about the performance of later data processing steps by observing the information content of the datasets. This is an instance of Infodynamics (or rather Infostatics, to be precise !)

The illustration below suggest the setting of this exploration: we initially have a source of information $\overline K$ and and observations process (unavailable to us) has produced a random vector of features $\overline X$.

knitr::include_graphics('./figs/classifierChain.jpeg')

In this script we use datasets whose sources are single random variables (RV), but multiple features as observations, and we will use the Source Multivariate Entropy Triangle (SMET) to explore the random vector of the observations as if it were the actual source of information.

To help build an understanding, we reproduce the Figure 6 from [@val:pel:17b].

knitr::include_graphics('./figs/annotatedSMET_ESWAFig6.jpeg')

Environment construction

Knitting options

options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(comment=NA, fig.width=6, fig.height=4)
knitr::opts_chunk$set(warning=FALSE)# Should not appear in the knitted document

Library loading

library(tidyverse) # Acceding to Mr. Wickham's proposals!
library(entropies) # This package. Imports many others. 
library(ggtern)    # Primitives to print ternary densities
library(vcd)       # Categorical benchmarks
library(mlbench)   # ml benchmarks
library(candisc)   # Wine dataset

Global option definition

fancy <- TRUE  # set this for nicer on-screen visualization
#fancy <- FALSE # Set this for either printing matter

# A color blind palette from: http://www.cookbook-r.com/Graphs/Colors_%28ggplot2%29/#a-colorblind-friendly-palette
# The palette with grey:
#cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# The palette with black:
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

getPlot <- TRUE # Flag to obtain plots for publication. 
getPlot <- FALSE #Comment to get .jpeg files rather than plots of ETs.
knitr::opts_chunk$set(comment=NA, fig.width=6, fig.height=4)
if (getPlot)
    knitr::opts_chunk$set(dev = 'pdf') # better for publication

An example of SMET use

Datasets available

Firts we bring in an inventory of databases to be explored. In a "real" application this would be the data to be explored.

data(datasets)
if(interactive()){#latex-ing the table for publications
    library(xtable)
    ds4latexing <- datasets %>% select(name, K, n, m) 
    row.names(ds4latexing) <- NULL
    names(ds4latexing) <- c("Dataset Name", 
                            "class card.", 
                            "num. features", 
                            "num. instances")
    thisLatex <- xtable(ds4latexing, 
                    caption="Some datasets considered in this study",
                    label="tab:datasets")
    align(thisLatex) <- xalign(thisLatex)
    thisLatex
}

Obtaining the entropies

The beginning step of the exploration is to obtain the entropies from all datasets and some other data for plotting. In a typical application this would involve only the particular datasets being considered.

We provide a convenience function $getDatasetSourceEntropies$ for this vignette and others, coupled to the $loadDataset$ function above.

edf <- data.frame()
for(i in 1:nrow(datasets)){
    dsRecord <-  datasets[i, ]
    dsName <- dsRecord$name
    edf <- rbind(
        edf,
        getDatasetSourceEntropies(
            loadDataset(dsName,dsRecord$packName) , 
            dsName, 
            className = dsRecord$className,
            idNumber= dsRecord$idNumber,
            withClass=TRUE, 
            type="total",
            method="emp"#Method to work out entropies,
            )
        )
}
summary(edf)

This returns the entropic components needed to plot the information in the SMET, along with some extra information:

Observing the entropy of the classes

It is important to quantify how much of the information in the class is captured in the features.

This can be captured in the triangle by considering an extended random vector $(\overline K, \overline X)$ and observing the source entropy for the classes: ideally, the information in the class should all be redundant, implying that set of features actually captures all of the information in the class.

onlyClasses <- edf %>% filter(isClass==TRUE)
# Plot the triangle with only the aggregate data with no Class
smetClasses <-  ggmetern(onlyClasses, fancy) + 
    geom_point(mapping=aes(shape=dsName), size=3) +
    scale_shape_manual(values=1:nrow(onlyClasses)) +
    labs(shape="Dataset")
if (fancy){
    smetClasses <- smetClasses + 
        ggtitle("How is the information in  Classes captured accross datasets?")
}
smetClasses
if (getPlot){
    dev.off()#Necessary to do the textual plot.
    ggsave("multisplit_for_class_features.jpeg", plot=smetClasses)
}

So we can actually see that

Plotting aggregate data

Now we print the total entropy balance of $\overline X$ in the triangle for the different datasets. This allows us to compare the different relative compositions of the datasets.

aggregateEdf <-  edf %>% 
            mutate(name=dsName) %>% # interested in whole datasets
            group_by(name, withClass) %>% # and whether the class is considered or not
            summarise(H_Ux = sum(H_Uxi),
                    H_Px = sum(H_Pxi),
                    DeltaH_Px = sum(DeltaH_Pxi),
                    VI_Px = sum(VI_Pxi),
                    M_Px = sum(M_Pxi))

Let's visualize the aggregate data with no Class. This tells us how redundant and balanced the random vector of the observations $\overline X$ is and how much information it is offering.

aggregateSMET <-  
    ggmetern(aggregateEdf %>% filter(withClass == FALSE), fancy) + 
        geom_point(mapping=aes(shape=name), colour="blue", size=3) +
        scale_shape_manual(values=1:nrow(datasets)) +
        labs(shape="Dataset")
if (fancy){
    aggregateSMET <- aggregateSMET + 
        ggtitle("How redundant are the features on average in each dataset?")
}
aggregateSMET
if (getPlot){
    dev.off()#Necessary to do the textual plot.
    ggsave("aggregated_without_label.jpeg", plot=aggregateSMET)
}

We can see that:

Aggregate triangles cannot offer more information: if we want to consider each feature independently, we have to use the split triangle (see below).

The SMET does not report transmitted information

Since the class is in the source and it has some information about each of the features. We next see this effect.

smet <-  ggmetern(aggregateEdf, fancy) + 
    geom_point(mapping=aes(shape=name, color=withClass), size=3) +
    #scale_colour_manual(values=cbbPalette) +
    scale_color_manual(values=c("blue", "black")) +
    scale_shape_manual(values=1:nrow(datasets)) +
    labs(shape="Dataset", colour="Including class")
if (fancy){
    smet <- smet + ggtitle("Redundancy including the class")
}
smet
if (getPlot){
    dev.off()#Necessary to do the textual plot.
    ggsave("aggregated_withWo_label.jpeg", plot=smet)
}

Clearly the class has a lot of information about the features. But the SMET does not concern itself with quantifying how much of the information in the class is transmitted to the individual or collective set of features. For this we need the Channel Binary or Multivariate Entropy Triangles (see their respective vignettes).

Plotting the multisplit data

We next go back to the question of how much independent information each of the features have.

We choose some of the interesting datasets from the diagram above to investigate:

#TODO: make a grid of these plots to be able to see anything different. 
#thisDsName <- "Ionosphere" # CAVEAT! Not enough different glyphs!!!
thisDsName <- "iris" # for paper, we first run this value, then "Glass"
# thisDsName <- "Glass"
# thisDsName <- "Arthritis"
# thisDsName <- "BreastCancer"
# thisDsName <- "Sonar"
# thisDsName <- "Wine"
# negatively subsetting recipe from Stack Overflow
thisEdf <- edf %>% filter(dsName == thisDsName & name != "ALL") %>% 
                      select(-starts_with("isClass"))
thisEdf <- rbind(thisEdf,
                 ungroup(aggregateEdf) %>% 
                      filter(name == thisDsName) %>% 
                      mutate(dsName=name, name = "@AVERAGE") %>%
                      select(name, 
                             H_Uxi=H_Ux, H_Pxi=H_Px, 
                             DeltaH_Pxi=DeltaH_Px,
                             M_Pxi=M_Px, 
                             VI_Pxi=VI_Px,
                             withClass, 
                             dsName)
                )

Create different geometries for different feature set cardinalities: first consider the features in the dataset without the class variable.

thisEt <-  ggmetern(filter(thisEdf, withClass == FALSE),  fancy) 
if ((nrow(thisEdf) - 1)/2 >= 14){#too many points for with glyphs
    thisEt <- thisEt + 
        stat_density_tern(geom='polygon',
                        aes(fill=..level..),
                        #base=base,  ###NB Base Specification
                        colour='grey50') + 
        scale_fill_gradient(low='green',high='red')  +
        geom_point(size=1)
}else {
    thisEt <- thisEt + geom_point(aes(shape=name), size=3) +
        scale_shape_manual(values=1:14) + 
        labs(shape="Feature") #+
    #ggtitle("Source Multivariate Entropies per Feature")
}
thisEt
if (getPlot){
    dev.off()
    ggsave(filename=sprintf("%s_without_class.jpeg", thisDsName))
}

By changing the chosen $dsName$ we could find some explanations why each dataset behaved as in previous sections.

Next consider the same set with the class label included.

thisEt <-  ggmetern(thisEdf, fancy) 
if ((nrow(thisEdf) - 1)/2 >= 14){#too many points to be represented with glpsh
    thisEt <- thisEt + #geom_density_tern(aes(fill=..level..)) +
        stat_density_tern(geom='polygon',
                        aes(fill=..level..),
                        #base=base,  ###NB Base Specification
                        colour='grey50') + 
        scale_fill_gradient(low='green',high='red')  +
        geom_point(size=1)
}else {
    thisEt <- thisEt + geom_point(aes(shape=name, colour=withClass), size=3) +
        scale_shape_manual(values=1:14) + 
        #scale_color_discrete("color_blind")
        #scale_colour_manual(values=cbbPalette)
        scale_color_manual(values=c("TRUE" = "blue", "FALSE"="black"))
    #ggtitle("Source Multivariate Entropies per Feature")
}
thisEt + labs(shape="Feature", color="Using class") #+
if (getPlot){
    dev.off()
    ggsave(filename=sprintf("%sW_WO_class.jpeg", thisDsName))
}

Since the database being analyzed is iris, recall that its class variable is name ??Species??. In this diagram we see:

Other representations for information balances

We are next comparing two visualizations for the entropy in the features of a dataset.

A comparison with stacked bars

The first one are stacked bars, where each of the components is marked in the standard colors in the decomposition.

# In case the switch is ON for excluding the aggregate.
excludeAggregate <- TRUE
#excludeAggregate <- FALSE

analyzeWithClass <- FALSE
#analyzeWithClass <- TRUE
if (analyzeWithClass){
    smedf <- filter(thisEdf, withClass) %>% select(-withClass)
} else {
# For this once let's just use the entropy with no class variable
    smedf <- filter(thisEdf, !withClass) %>% select(-withClass) # source multivariate entropy data frame
}
p <-  ggmebars(smedf, excludeAggregate,proportional=FALSE)
p + ylab("Source Multivariate Entropy") + xlab("Feature/Variable") #+
    #ggtitle("Absolute Source Multivariate Entropies per Feature")
    # scale_y_continuous(trans=log2_trans(),
    #                 breaks=trans_breaks("log2", function(x) 2^x),
    #                 labels=trans_format("log2", math_format(2^.x)))
# 
if (getPlot){
    dev.off()
    ggsave(filename=sprintf("%s_entropy_bars_noAgg.jpeg", 
                            thisDsName))
}

This is certainly illustrative and shows the partition nicely to the trained eye. Even the decreasing bars suggests that the most informative feature is the first one. However, the important concept for a feature is is remanent entropy, so why not order it that way?

The multisplit SMET can also offer this ordering information with a different encoding: using the fill color on the glyphs used to represent the maximum entropy per feature. The other informations in the balance are not ordered but simply stem off from the diagram: in particular de remanent entropy correlates with distance from the right side.

ggmetern(filter(smedf, name != "@AVERAGE"), fancy) +
    geom_point(size=4, aes(shape=name, colour=H_Uxi)) + 
    scale_shape_manual(values=1:14) +
    scale_color_gradient(low="grey", high="black") +
    labs(shape="Feature", colour="$\\textit{H_{U_{X_i}}}") 
    #theme(legend.position="bottom")
if (getPlot){
    dev.off()
    ggsave(filename=sprintf("%s_smet_noAgg_absoluteEntropy.jpeg", 
                            thisDsName))
}

Note how this is a necessary exploratory step prior to envisage any later feature transformation.

A comparison with pie charts

Another representation for this type of data are pie charts. However, turning the stacked bar graph into a pie chart is a bad idea, since the remaining information $VI_{P_{X_i}}$ is de-emphasized (that is, compare with the area in the stacked bar graph above).

 p + ylab("Source Multivariate Entropy") + xlab("Feature/Variable") + coord_polar() #+
    #ggtitle("Relative Source Multivariate Entropies per Feature")
if (getPlot){
    dev.off()
    ggplot2::ggsave(filename=sprintf("%s_entropy_pie_noAgg.jpeg", 
                                     thisDsName))
}

We firmly believe the SMET is a better representation than this.

Postscriptum

More information about the evaluation of sources with the Source Multivariate Entropy Triangle can be found in

library(bibtex)
print(citation("entropies")['val:pel:17b'], style="text")

Session information

sessionInfo()


FJValverde/entropies documentation built on Oct. 12, 2023, 10:17 p.m.