This almost-vignette tries to demonstrate the differences in source entropy calculation by using several distinct methods of discretization of any of a number of databases.
Eventually, what we prove is that "equalfreq" discretization is not appropriate at all for entropy calculations, as expected.
#knitr::opts_chunk$set(dev = 'pdf') # plots in pdf, better for publication knitr::opts_chunk$set(comment=NA, fig.width=6, fig.height=4) knitr::opts_chunk$set(warning=FALSE)# Should not appear in the knitted document
library(tidyverse) # Acceding to that infamous Mr. Wickham's requests! library(infotheo) # The functionality provided by this has to be rerouted through entropies library(entropies) # This package. Needs "entropy" and "caret" too. library(ggtern) # Ternary diagrams on ggplot library(vcd) # Categorical benchmarks library(mlbench) # ml benchmarkss library(candisc) # Wine dataset fancy <- TRUE # set this for nicer on-screen visualization #fancy <- FALSE # Set this for either printing matter
datasets <- getDatasets() #%>% filter(name=="iris") #Types of discretizations explored: look up meaning in infotheo::discretize discs <- c("equalwidth", "equalfreq","globalequalwidth")
Obtain the entropies and some other data for plotting from discretizations.
edf <- data.frame() for(i in 1:nrow(datasets)) { dsRecord <- datasets[i,] #filter(datasets, name == dsName) dsName <- dsRecord$name ds <- evalDataset(dsName) for (discType in discs) { disc.ds <- infotheo::discretize(ds, disc=discType) edf <- rbind( edf, cbind( disc = discType, getDatasetSourceEntropies(disc.ds, dsName, disc=discType, className = dsRecord$className, idNumber = dsRecord$idNumber, withClass = TRUE, type = "total", method = "emp" #Method to work out entropies, from infotheo::entropy ) ) ) } } str(edf)
Now let's plot on a per-database basis the results of the three discretizations.
We choose some of the interesting datasets from the diagram above to investigate:
#TODO: make a grid of these plots to be able to see anything different. # thisDsName <- "Ionosphere" # CAVEAT! Not enough different glyphs!!! thisDsName <- "iris" # first run this value, then say, "Glass" # thisDsName <- "Glass"# Try this for setting a gradient for features # thisDsName <- "Arthritis" # thisDsName <- "BreastCancer" # thisDsName <- "Sonar" # thisDsName <- "Wine" # negatively subsetting recipe from Stack Overflow thisEdf <- edf %>% filter(dsName == thisDsName & name != "ALL") %>% select(-starts_with("isClass")) # Create different geometries for different feature set cardinalities: # First consider the features in the dataset without the class variable thisEt <- ggmetern(filter(thisEdf, withClass == TRUE), fancy) if ((nrow(thisEdf) - 1)/2 > 14){#too many points to be represented with glpsh thisEt <- thisEt + #geom_density_tern(aes(fill=..level..)) + stat_density_tern(geom='polygon', aes(fill=..level..), colour='grey50') + scale_fill_gradient(low='green',high='red') + geom_point(aes(shape=disc), size=2) + scale_shape_manual(values=c(21, 22, 24)) + labs(fill="Feature", shape="Discretizations") }else { thisEt <- thisEt + geom_point(aes(shape=name,color=disc), size=3) + scale_shape_manual(values=1:14) + labs(shape="Feature",color="Discretization type") #+ } thisEt + ggtitle("Source Multivariate Entropies per Feature and discretization") ggsave(filename=sprintf("%s_discretizationss_W_class.jpeg", thisDsName))
We may conclude, as expected, that, for our purposes, equal frequency binning is not adequate, since it loses information of the uncertainty in the distribution domain.
For estimating entropies and plotting them in the SMET, equal width binning should be preferred, pending a more thorough evaluation of "globalequalwidth" discretization.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.