Classification example using tensorflow"
In nn2poly: Neural Network Weights Transformation into Polynomial Coefficients

Multiclass classification example

After showing how to use nn2poly in a regression setting both in vignette("nn2poly-01-introduction") and vignette("nn2poly-02-supported-DL-frameworks"), we will see here a multiclass classification example using the iris dataset to showcase how nn2poly obtains a polynomial for each class (or output neuron).

In order to train our model we will use in this case the keras/tensorflow framework, and we will benefit from the provided support for it in nn2poly as we saw in vignette("nn2poly-02-supported-DL-frameworks").

library(nn2poly)
library(keras)

# This sets all needed seeds
tensorflow::set_random_seed(42)

Data preparation

First we load the iris dataset and scale the data to the $[-1,1]$ interval:

# Load the data
data(iris)

# Change response to numeric. In this case, Species was already numeric,
# but this step is needed if it is a factor variable.
iris$Species <- as.numeric(iris$Species)

# Define dimension p (number of predictor variables)
p <- dim(iris)[2] - 1

# Define objective classes
n_class <- max(iris[,(p+1)])

# Move objective classes from (1:3) to (0:2), needed for tensorflow
iris[,(p+1)] <- iris[,(p+1)] - 1

# Scale the data in the [-1,1] interval and separate train and test
# Only the predictor variables are scaled, not the response as those will be
# the different classes.
iris_x <- iris[,-(p+1)]
maxs <- apply(iris_x, 2, max)
mins <- apply(iris_x, 2, min)
data_x_scaled <- as.data.frame(scale(iris_x, center = mins + (maxs - mins) / 2, scale = (maxs - mins) / 2))
data <- cbind(data_x_scaled, iris[,(p+1)])

# Divide in train (0.75) and test (0.25)
index <- sample(1:nrow(data), round(0.75 * nrow(data)))
train <- data[index, ]
test <- data[-index, ]

train_x <- as.matrix(train[,-(p+1)])
train_y <- as.matrix(train[,(p+1)])

test_x <- as.matrix(test[,-(p+1)])
test_y <- as.matrix(test[,(p+1)])

Neural network

In this section we will create and train (imposing weight constraints) our original neural network that will be later explained by nn2poly().

Build and train the NN

First, we build the model.

keras_model <- function() {
  tensorflow::set_random_seed(42)

  nn <- keras::keras_model_sequential()
  nn <- keras::layer_dense(nn, units = 100, activation = "tanh", input_shape = p)
  nn <- keras::layer_dense(nn, units = 100, activation = "tanh")
  nn <- keras::layer_dense(nn, units = n_class, activation = "linear")

  nn
}

nn <- keras_model()
# Impose weight constraints provided by nn2poly package
nn <- add_constraints(nn, constraint_type = "l1_norm")
nn
#> Model: "sequential"
#> ________________________________________________________________________________
#>  Layer (type)                       Output Shape                    Param #     
#> ================================================================================
#>  dense (Dense)                      (None, 100)                     500         
#>  dense_1 (Dense)                    (None, 100)                     10100       
#>  dense_2 (Dense)                    (None, 3)                       303         
#> ================================================================================
#> Total params: 10903 (42.59 KB)
#> Trainable params: 10903 (42.59 KB)
#> Non-trainable params: 0 (0.00 Byte)
#> ________________________________________________________________________________

Note: In this case the NN has a linear output with the same number of neurons as the number of classes to predict (3 species, n_class). Then, the linear output will be transformed into a probability to find the most probable class but this step is done after training. Therefore, nn2poly will be used to obtain a polynomial that approximates this nn with linear outputs and then its results will also be transformed in probabilities to predict the highest probability class.

Then we compile and train the model, using a categorical crossentropy loss and accuracy as the chosen metric.

compile(nn,
        loss = loss_sparse_categorical_crossentropy(from_logits = TRUE),
        optimizer = optimizer_adam(),
        metrics = "accuracy")

history <- fit(nn,
               train_x,
               train_y,
               verbose = 0,
               epochs = 200,
               validation_split = 0.3
)

plot(history)

NN Predictions

In this case, to asses the NNs accuracy we have to transform their output into a probability, which we will do by using layer_activation_softmax() and then choosing the class with highest probability with k_argmax.

# Define the model probability model using our previously trained nn
probability_model <- keras_model_sequential() %>%
  nn() %>%
  layer_activation_softmax() %>%
  layer_lambda(k_argmax)

# Obtain the predicted classes with the NN to compare them
prediction_NN_class <- predict(probability_model, test_x)
#> 2/2 - 0s - 70ms/epoch - 35ms/step

Note: As said before, nn2poly() will approximate the linear output of the neural network, so we will need to perform these transformation also in the polynomial predictions to obtain a classification response.

We can also compute the linear output of the model (before converting it to a probability) and store it to compare with the polynomial output later.

# Also, the linear output can be predicted before the probability model
prediction_NN <- predict(nn, test_x)
#> 2/2 - 0s - 40ms/epoch - 20ms/step

Finally We can use here a confusion matrix to visualize the results, where we can see that that the NN correctly predicts almost all of the classes in the test data:

# Create a confusion matrix
cm <- caret::confusionMatrix(as.factor(prediction_NN_class), as.factor(test_y))
cm
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1  2
#>          0 11  0  0
#>          1  0 11  1
#>          2  0  1 14
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.9474          
#>                  95% CI : (0.8225, 0.9936)
#>     No Information Rate : 0.3947          
#>     P-Value [Acc > NIR] : 7.82e-13        
#>                                           
#>                   Kappa : 0.9203          
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: 0 Class: 1 Class: 2
#> Sensitivity            1.0000   0.9167   0.9333
#> Specificity            1.0000   0.9615   0.9565
#> Pos Pred Value         1.0000   0.9167   0.9333
#> Neg Pred Value         1.0000   0.9615   0.9565
#> Prevalence             0.2895   0.3158   0.3947
#> Detection Rate         0.2895   0.2895   0.3684
#> Detection Prevalence   0.2895   0.3158   0.3947
#> Balanced Accuracy      1.0000   0.9391   0.9449

Polynomial representation

Using `nn2poly`

After the NN has been trained, we can directly call nn2poly() on the model, where we only need to choose the value max_order which will limit the order of the computed coefficients in the final polynomial. In this case we will set it as 3.

Note that in this case, as we have 3 output neurons, there will be 3 output polynomials. The polynomials will be stored in the same way as in the regression case, in a list with labels and values, but in this case the values will be a matrix with 3 rows, where each row will be the polynomial obtained for each output neuron.

# Polynomial for nn
final_poly <- nn2poly(object = nn,
                      max_order = 3)
final_poly
#> $labels
#> $labels[[1]]
#> [1] 0
#> 
#> $labels[[2]]
#> [1] 1
#> 
#> $labels[[3]]
#> [1] 2
#> 
#> $labels[[4]]
#> [1] 3
#> 
#> $labels[[5]]
#> [1] 4
#> 
#> $labels[[6]]
#> [1] 1 1
#> 
#> $labels[[7]]
#> [1] 1 2
#> 
#> $labels[[8]]
#> [1] 1 3
#> 
#> $labels[[9]]
#> [1] 1 4
#> 
#> $labels[[10]]
#> [1] 2 2
#> 
#> $labels[[11]]
#> [1] 2 3
#> 
#> $labels[[12]]
#> [1] 2 4
#> 
#> $labels[[13]]
#> [1] 3 3
#> 
#> $labels[[14]]
#> [1] 3 4
#> 
#> $labels[[15]]
#> [1] 4 4
#> 
#> $labels[[16]]
#> [1] 1 1 1
#> 
#> $labels[[17]]
#> [1] 1 1 2
#> 
#> $labels[[18]]
#> [1] 1 1 3
#> 
#> $labels[[19]]
#> [1] 1 1 4
#> 
#> $labels[[20]]
#> [1] 1 2 2
#> 
#> $labels[[21]]
#> [1] 1 2 3
#> 
#> $labels[[22]]
#> [1] 1 2 4
#> 
#> $labels[[23]]
#> [1] 1 3 3
#> 
#> $labels[[24]]
#> [1] 1 3 4
#> 
#> $labels[[25]]
#> [1] 1 4 4
#> 
#> $labels[[26]]
#> [1] 2 2 2
#> 
#> $labels[[27]]
#> [1] 2 2 3
#> 
#> $labels[[28]]
#> [1] 2 2 4
#> 
#> $labels[[29]]
#> [1] 2 3 3
#> 
#> $labels[[30]]
#> [1] 2 3 4
#> 
#> $labels[[31]]
#> [1] 2 4 4
#> 
#> $labels[[32]]
#> [1] 3 3 3
#> 
#> $labels[[33]]
#> [1] 3 3 4
#> 
#> $labels[[34]]
#> [1] 3 4 4
#> 
#> $labels[[35]]
#> [1] 4 4 4
#> 
#> 
#> $values
#>              [,1]         [,2]         [,3]
#>  [1,]  0.04067109  2.838697004 -2.366766912
#>  [2,] -1.01692952  0.922284435  0.306762048
#>  [3,]  3.82618130 -1.532152815 -2.683000514
#>  [4,] -4.86343269 -0.367029700  5.201883882
#>  [5,] -6.05908248 -2.026009803  7.701675045
#>  [6,]  0.02216647 -0.068650500  0.031318449
#>  [7,] -0.09821698  0.115922745  0.007638440
#>  [8,]  0.08146315 -0.072443601 -0.025525980
#>  [9,]  0.04337399 -0.019371395 -0.029198248
#> [10,]  0.08763212 -0.122314949  0.007196786
#> [11,] -0.13090934  0.160297202  0.008898167
#> [12,] -0.03982065  0.118363240 -0.047742059
#> [13,] -0.03990296 -0.186114002  0.182448046
#> [14,] -0.22518863 -0.383706458  0.517483976
#> [15,] -0.21539519 -0.340739130  0.476053215
#> [16,]  0.01640856 -0.015861053 -0.004179781
#> [17,] -0.08649215  0.054035395  0.045407899
#> [18,]  0.08657797 -0.020808458 -0.071458194
#> [19,]  0.10626455  0.016833308 -0.120917295
#> [20,] -0.09226002  0.058721711  0.047211597
#> [21,] -0.12782353  0.099224893  0.053681832
#> [22,] -0.14300636  0.061378209  0.099556359
#> [23,]  0.08681554 -0.019495659 -0.072698532
#> [24,]  0.07906020 -0.071096510 -0.027609755
#> [25,]  0.10751970  0.019949727 -0.124596164
#> [26,] -0.12832404  0.057077780  0.086244551
#> [27,]  0.31410761 -0.112571487 -0.233324113
#> [28,]  0.39822386 -0.072171099 -0.351003640
#> [29,]  0.32366110 -0.115551628 -0.240170974
#> [30,] -0.64179985  0.150429244  0.539640723
#> [31,]  0.40593582 -0.069258417 -0.360602426
#> [32,]  0.17561230  0.009076263 -0.185280705
#> [33,]  0.51767960  0.109300583 -0.609898697
#> [34,]  0.51398607  0.101899652 -0.600694175
#> [35,]  0.34526975  0.146409617 -0.461815984
#> 
#> attr(,"class")
#> [1] "nn2poly"

Obtaining polynomial predictions

As said before, the obtained polynomial represents the neural network before including the softmax function and computing the class assigned to each observation. Then, we need to define again a keras sequential model that includes the class computation from the polynomial output. This polynomial output is obtained with predict() used on the nn2poly() output, in this case in matrix form, as the 3 polynomials are evaluated at the same time:

# Obtain the predicted values for the test data with our Polynomial Regression
prediction_poly_matrix <- predict(object = final_poly, newdata = test_x)

# Define probability model with keras for the polynomial outputs
probability_poly <- keras_model_sequential() %>%
  layer_activation_softmax() %>%
  layer_lambda(k_argmax)

# Class prediction with the polynomial outputs
prediction_poly_class <- predict(probability_poly,
                                 prediction_poly_matrix)
#> 2/2 - 0s - 121ms/epoch - 60ms/step

Visualizing the results

With the polynomial predictions, there are two options. We can represent in a diagonal line the linear outputs obtained directly from the polynomial and NN predictions, or compare the assigned classes after employing the probability models. Please note here that we compare the predictions (linear and classes) of the polynomials with the NN predictions and not the original data, as nn2poly's goal is to faithfully represent the NN behavior independently of how well the NN predicts.

First, let's observe the confusion matrix for the assigned classes:

# Confussion matrix between NN class prediction and polynomial class prediction
cm_poly <- caret::confusionMatrix(as.factor(prediction_NN_class), as.factor(prediction_poly_class))
cm_poly
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1  2
#>          0 11  0  0
#>          1  0 12  0
#>          2  0  0 15
#> 
#> Overall Statistics
#>                                      
#>                Accuracy : 1          
#>                  95% CI : (0.9075, 1)
#>     No Information Rate : 0.3947     
#>     P-Value [Acc > NIR] : 4.568e-16  
#>                                      
#>                   Kappa : 1          
#>                                      
#>  Mcnemar's Test P-Value : NA         
#> 
#> Statistics by Class:
#> 
#>                      Class: 0 Class: 1 Class: 2
#> Sensitivity            1.0000   1.0000   1.0000
#> Specificity            1.0000   1.0000   1.0000
#> Pos Pred Value         1.0000   1.0000   1.0000
#> Neg Pred Value         1.0000   1.0000   1.0000
#> Prevalence             0.2895   0.3158   0.3947
#> Detection Rate         0.2895   0.3158   0.3947
#> Detection Prevalence   0.2895   0.3158   0.3947
#> Balanced Accuracy      1.0000   1.0000   1.0000

The polynomials obtain the same results as the original NN.

Then, we can extract a diagonal plot for each of the polynomials obtained, in total $3$ diagonal plots.

for (i in 1:3){
  print(
    nn2poly:::plot_diagonal(x_axis =  prediction_NN[,i],
                  y_axis =  prediction_poly_matrix[,i],
                  xlab = "NN prediction",
                  ylab = "Polynomial prediction")
        )
}

We can observe how all the polynomials obtain quite similar predictions to their equivalent NN predictions.

We can also plot the $n$ most important coefficients in absolute value to compare which variables or interactions are more relevant in each polynomial. In this case, we will have 3 plots for each NN again, one per polynomial at each output neuron. In this case, the obtained coefficients will represent the most important variables when assigning the probability to be in each class.

plot(final_poly, n = 8)

Note: The coefficients values are not in the same scale as the original polynomial due to the fact that we have scaled all the data before training, even the response variable Y. Furthermore, as data has been scaled to the $[-1,1]$ interval, interactions of order 2 or higher would usually need a higher absolute value than the lower order coefficients to be more relevant

Finally, the problem with Taylor expansion can be checked with the following plots, where each layer is represented with their activation function, its Taylor expansion, the error and also the density of the activation potentials that the activation functions receives at that layer.

It can be clearly seen with the activation potentials density, in green, that it is kept close to zero, thus having an accurate Taylor expansion around zero at each hidden layer.

nn2poly:::plot_taylor_and_activation_potentials(object = nn,
                                                data = train,
                                                max_order = 3,
                                                constraints = TRUE)
#> [[1]]