In mmider/OxWaSPneuralnets: Neural networks

Introduction

Our R package contains an implementation of a Neural Network for a classification problem with any arbitrary layers and hidden features. To improve performace the major bulk of the code has been written in C, and can potentially be run in parallel on CPUs.

Neural Net Structure

Diagram below^[image taken from http://cs231n.github.io/neural-networks-1/ website] represents a structure of a neural net alt text Input layer denotes the data $x^{(0)}$ - number of nodes in an input layer equals the dimensionality of the data. Each hidden layer comprises of units/nodes and each of these has its own bias term and weight vector. The units from each layer take the ouputs of the previous layer, compute the score by applying the linear transformation according to its weights and bias and output a non-linear transformation of the score. The output of the nodes from the final layer is interpreted as the class probability for the classification problem. In the picture above $x$ would be three dimensional. In the final picture the final layer has only one node, which is a setup for either regression or binary classification problem.

Pseudocode

Essentially, the algorithm for training a neural network the following:

initialise weights and biases according to N(0,0.1)
for ( i in 1 : num_epochs ) do:
- for ((x,y) in train_data) do:
  - feedforward pass
  - backpropagation
  - update parameters
- end for
end for

Feedforward pass

For simplicity, we will focus on one hidden layer only for a K-class classification problem. Each node on an $l^{th}$ hidden layer takes the input $x^{(l-1)}$ and transforms it according to $x_i^{(l)}=\sigma(x^{(l-1)T} w_i^{(l)}+b_i^{(l)})$, where $w_i^{(l)}$ and $b_i^{(l)}$ are respectively the weight vector and the bias term of the $i^{th}$ unit in the $l^{th}$ hidden layer. Also $\sigma(z)=\frac{1}{1+exp(-z)}$. The output layer does almost the same, but uses softmax instead of sigmoid. The outputs are $P(y = k|x)$ for $k=1\dots K$ - class labels.

Backpropagation

We compute the error terms for all the nodes in the neural net. We proceeed recursively over layers starting from the last one and moving backwards, i.e. for $l=L,(L-1),\dots,1$ (L being output layer), for each node $i$ in a given layer $l$ we compute: $\delta_i^{(l)}$ - the error term for a given node. These allow us to compute the gradients of the loss function (we call it R) with respect to the weights and the biases: $\frac{\partial R}{\partial w_{ij}^{(l)}}$ and $\frac{\partial R}{\partial b_{i}^{(l)}}$.

Parameters' update

By moving weights and biases along the direction of the gradients computed in the backpropagation step above we are aiming to minimise the (regularised) loss (taken to be a cross entropy in the classification scenario). We use the momentum update as well as regularisation. First update momentum for layer $l$: $v^{(l)} = \mu v^{(l)} - \alpha(\Delta w^{(l)}+\lambda w^{(l)})$ where $\lambda$ is the regularization parameter, $\mu$ is the contraction of the momentum and $\alpha$ is the learning step. Then update the weights $w^{(l)} = w^{(l)} +v^{(l)}$. Then repeat for the biases. Perform these for the output and all the hidden layers.

Implementation

In this R package we have provided two functions:

fit_neural_network
CV_neural_network

The first one is used for different purposes. It takes in both train and test data sets and the user can specify the neural net parameters: number of hidden layers and number of neurons for each layer. Moreover the user can specify the parameters adopted in the training of the neural net: the step size of the batch gradient descent step, and the regularization parameter $\lambda$.

This function call a C function that implements the pseudocode written above. In order to speed up the computation, the training set can be splitted into subsets, and inside each epoch (see pseudocode) each CPU can run the inner for-loop for a specific subset of the training data set. The number of cores can be chosen by the user.

The C function recalls some ISPC functions to exploit the vectorization in the feedforward pass (in particular for the computation of the scores) and in the weights and biases update in the gradient descent step^[Future developments will apply vectorization also in the backpropagation step].

The function CV_neural_network makes it possible to tune the regularization parameter $\lambda$ by carrying out cross-validation on the training data, and finally fitting the neural network with the best configuration to the whole training portion to obtain predictions on test data.

Minimal example

Here we have provided a minimal example to fit a neural network to the subset of the MNIST hand-written digits dataset

library(OxWaSPneuralnets)
data(mnist)
res = fit_neural_network(train$x[1:1000, ], train$y[1:1000], test$x[1:250, ], test$y[1:250], n_iterations = 500, step_size = 0.0001)
plot(res)
res