In ntyndall/cerebrum:

# Load libraries and read data
library(DiagrammeR)
library(cerebrum)
library(ggplot2)
library(purrr)
library(pdata)
library(extrafont)
library(extrafontdb)
library(magrittr)

Introduction

The idea of this report is to try and build up a neural network from scratch in R, gaining an understanding of how the algorithm works and how it is used to predict unseen data. We will try to avoid much of the mathematical concepts in a great deal of rigour, and attempt to provide the concept through doing. There are some excellent documents and explanations of why things are the way they are online, and this is not the purpose of this report.

So, what is a neural network and why should I care? It is a system made up of multiple neurons, where a neuron can be thought of simply as some function. In the sense that a neuron takes some input, processes that information in a way, and spits out its result based on how it regards that input.

knitr::include_graphics(path = "./images/neural_neuron.pdf")

As mentioned earlier, a neural network is comprised of these neurons, in multiple layers. Therefore they can be arranged in multiple ways, for example;

knitr::include_graphics(path = "./images/simple.pdf")

Theory and Equations

Before we go any further, let us develop some key points and notions that will help us understand at least what is happening when building and learning a neural network.

In order to calculate the activations for each of the neurons (not including the input layer - the activations here are just the inputs $\mathbf{x}$), we define the following, for every layer $$ a^{i}{j} = \sigma\Big(\sum^{l{i}}{k=1}w^{i}{jk}a^{i-1}{k} + b^{i}{j}\Big) ~~~~~~~~ \forall ~ i \in [l_2, ..., l_N] $$ where $\sigma$ is known as the activation function, which for the purposes here we will choose to be a sigmoid function defined as, $$ \sigma(\phi) = \frac{1}{1 + e^{-\phi}} $$ The summation above is over the number of neurons in each layer, which will hopefully make sense when covering the worked example. It is simply just the product of all the weights and their respective activations of the previous layer, with an additional term to account for the bias of each neuron. So the activations, as stated above, for layer 2 are calculated using the input vector from the input layer, i.e. $a^{1}{j} \equiv x{j}$.

A simple implementation of this in R is just the following;

sig <- function(phi, prime = F) {

  # Calculate the exponential part
  expon <- phi %>% `*`(-1) %>% exp

  # Return sigmoid (or the derivative)
  return(if (prime) expon %>% `/`((1 + expon)^2) else 1 %>% `/`(1 + expon))
}

which can also return the derivative (as will be needed later).

inp <- seq(from = -10, to = 10, by = 0.001)
df <- data.frame(x = inp, y = inp %>% cerebrum::sig(), stringsAsFactors = F)

# Plot the sigma function
g <- ggplot2::ggplot(data = df, ggplot2::aes(x, y)) %>%
  `+`(ggplot2::geom_line()) %>%
  `+`(pdata::plot_theme())
g

Once we calculate all the activations at each neuron, the most important will be those that are in the output layer. The highest activation means that neuron will be firing for a given set of input parameters. In order to measure the accuracy, a cost function is introduced, which is defined as, $$ C(x) = \frac{1}{2n}\sum_{i = 1}^{T} || o(x) - a^{l_N}(x)||^{2} \approx \frac{1}{n}\sum^{T}_{i=1}C_i $$ which is really just a mean squared error function. We have used the assumption that this cost can be approximated by taking the average of individual training examples, i.e. each training example is just $C_i = \frac{1}{2}||o(x) - a^{l_N}(x)||^2$. Therefore the result is really a comparison between the actual output values and the final activation layer output values. This gives us some idea of the error of our neural network and provides some feedback mechanism for optimizing at a later stage. The summation is over all training data ($T$), and the cost is a function of the input parameters for each training data.

Up to now, there is no real way to know how to adjust any of the values to optimise the classification. In order to do so we assume that $C(x)$ isn't so rigid, and allow for a small deviation in its value, i.e. $C(x)\rightarrow C(x) + \Delta C(x)$. By introducing this value, we can propagate the error throughout the network, so that small changes in the weights and biases will account for this $\Delta C(x)$.

Backpropagation Equations

The following four equations are all that are required to apply the backpropagation algorithm in order to adjust the weights and biases. Note that the first equation is applied to kick things off, the second equation combines $\delta$'s in adjacent layers (flowing backwards), and finally, the partial derivatives with respect to the baises and weights are the quantities required in order to optimise them. See the worked example section to compute all these components. We will see later that we can just drop the $j$ indexes, and write out the matrices in each layer. $$ \begin{split} \delta_j^{l_N} &= \frac{\partial C}{\partial a_j^{l_N}} \odot \sigma^{\prime}(\phi^{l_N}j) \ \delta^l_j &= ((w{jk}^{l+1})^{\top}\delta_j^{l+1}) \odot \sigma^{\prime}(\phi_j^l) \ \frac{\partial C}{\partial b_j^l} &= \delta_j^l \ \frac{\partial C}{\partial w_{jk}^l} &= a^{l - 1}_k\delta_j^l \end{split} $$

Worked Example

We initiate a neural network with 4 layers, with 2 input neurons and 2 output neurons. The hidden layers consist of 3 and 4 layers. We will introduce the weights and biases at each stage of the calculation, otherwise the pictorial view of the network would look overwhelming. For now we will set up a list object of the network in R and store it in exampleNN;

# Set up a simple neural network with the following nodes in each layer
exampleNN <- list(
  layers = 4, # Number of layers in the network
  bias = list(
    matrix(c(0.4, 0.8, 0.1)), # Layer 2 biases
    matrix(c(0.2, 0.4, 0.1, 0.8)), # Layer 3 biases
    matrix(c(0.1, 0.6)) # Layer 4 biases
  ),
  weights = list(
    matrix( # Weights between layers 1 & 2
      data = c(0.2, 0.1, 0.3, 0.4, 0.8, 0.1), 
      nrow = 3
    ),
    matrix( # Weights between layers 2 & 3
      data = c(0.3, 0.1, 0.9, 0.7, 0.6, 0.1, 0.2, 0.3, 0.5, 0.4, 0.2, 0.8), 
      nrow = 4
    ),
    matrix( # Weights between layers 3 & 4
      data = c(0.3, 0.6, 0.1, 0.9, 0.4, 0.7, 0.2, 0.7), 
      nrow = 2
    )
  )
)

# Input arguments (training data)
inargs <- matrix(c(1, 3)) 

# Output arguments (classes)
outargs <- matrix(c(0, 1))

Now that we have introduced some ideas and terminology, we will propagate through the neural network by example for a single iteration, to see what is actually happening when we scale this up. Hopefully after going through the example, the equations in the previous section will begin to make more sense, at least in terms of what they mean in the bigger picture. We start by taking the definition of $a^{i}_{j}$, and calculate these for layers 2, 3, and 4, where we define the first layer as the input layer (usually the input of your data set), the last layer as the output layer (for example, the number of output neurons can be the number of classes during classification of your data). And finally, any

\newpage

Activations (Feedforward)

# Calculate feed forward activation layer values sigma(w . a + b)
wInputs <- list()
activations <- list(inargs)
for (i in 1:(exampleNN$layers - 1)) {
  weightedInput <- exampleNN$weights[[i]] %*% inargs %>% `+`(exampleNN$bias[[i]])
  wInputs %<>% c(weightedInput %>% list)
  inargs <- weightedInput %>% sig()
  activations %<>% c(inargs %>% list)
}

# Return weighted inputs and the activation values
forward <- list(
  activations = activations,
  wInputs = wInputs
)

Layer 2

knitr::include_graphics(path = "./images/neural_part2.pdf")

Resulting in the following activations, $$ \begin{split} a^{2}{1} = \sigma(\textcolor{blue}{(0.2 \times 1) + (0.4 \times 3) + 0.4}) = \sigma(1.8) \approx 0.8581 \ a^{2}{2} = \sigma(\textcolor{red}{(0.1 \times 1) + (0.8 \times 3) + 0.8}) = \sigma(3.3) \approx 0.9644 \ a^{2}_{3} = \sigma(\textcolor{green}{(0.3 \times 1) + (0.1 \times 3) + 0.1}) = \sigma(0.7) \approx 0.6682 \end{split} $$ Therefore we have, $$ \phi^2 = \begin{bmatrix} 1.8 \ 3.3 \ 0.7 \ \end{bmatrix}, ~~~ a^2 = \begin{bmatrix} 0.8581 \ 0.9644 \ 0.6682 \ \end{bmatrix} $$

Layer 3

knitr::include_graphics(path = "./images/neural_part3.pdf")

Resulting in the following activations, $$ \begin{split} a^{3}{1} = \sigma(\textcolor{blue}{(0.3 \times 0.8581) + (0.6 \times 0.9644) + (0.5 \times 0.6682) + 0.2}) = \sigma(1.3702) \approx 0.7974 \ a^{3}{2} = \sigma(\textcolor{red}{(0.1 \times 0.8581) + (0.1 \times 0.9644) + (0.4 \times 0.6682) + 0.4}) = \sigma(0.8495) \approx 0.7005 \ a^{3}{3} = \sigma(\textcolor{green}{(0.9 \times 0.8581) + (0.2 \times 0.9644) + (0.2 \times 0.6682) + 0.1}) = \sigma(1.1988) \approx 0.7683 \ a^{3}{4} = \sigma(\textcolor{orange}{(0.7 \times 0.8581) + (0.3 \times 0.9644) + (0.8 \times 0.6682) + 0.8}) = \sigma(2.2246) \approx 0.9024 \end{split} $$ Therefore we have, $$ \phi^2 = \begin{bmatrix} 1.3702 \ 0.8495 \ 1.1988 \ 2.2246 \ \end{bmatrix}, ~~~ a^2 = \begin{bmatrix} 0.7974 \ 0.7005 \ 0.7683 \ 0.9024 \ \end{bmatrix} $$

Layer 4

knitr::include_graphics(path = "./images/neural_part4.pdf")

Resulting in the following activations, $$ \begin{split} a^{4}{1} = \sigma(\textcolor{blue}{(0.3 \times 0.7974) + (0.1 \times 0.7005) + (0.4 \times 0.7683) + (0.2 \times 0.9024) + 0.1}) = \sigma(0.8971) \approx 0.7103 \ a^{4}{2} = \sigma(\textcolor{red}{(0.6 \times 0.7974) + (0.9 \times 0.7005) + (0.7 \times 0.7683) + (0.7 \times 0.9024) + 0.6}) = \sigma(0.9468) \approx 0.9468 \end{split} $$ Therefore we have, $$ \phi^2 = \begin{bmatrix} 0.8971 \ 0.9468 \ \end{bmatrix}, ~~~ a^2 = \begin{bmatrix} 0.7103 \ 0.9468 \end{bmatrix} $$

Backpropagation

Now that we have calculated all the activations at each neuron for a single iteration, we can finally apply the backpropagation algorithm to calculate the delta' and therefore any changes that need to be made to the weights and biases in order to reach the desired output vectors for each corresponding piece of input.

The following work and equations can be used for the cost function, $C(x)$ as defined before.

Kickstarting the algorithm

To start the algorithm, we calculate the first value, $\delta^{N_l}$, where we are using our total number of layers as $N_l = 4$, and also dropping the $j$ component, we can write this out in matrix form. $$ \delta^{4} = (a^{4} - o) \odot \sigma^{\prime} (\phi^4) = \Big(\begin{bmatrix} 0.7104 \ 0.9468 \end{bmatrix} - \begin{bmatrix} 0 \ 1 \end{bmatrix} \Big) \odot \sigma'\Big(\begin{bmatrix} 0.8971 \ 2.8784 \end{bmatrix}\Big) = \begin{bmatrix} 0.1462 \ -0.002681 \end{bmatrix} $$

The following snippet can do just that;

# Function for retrieving the last element of a list
last_element <- function(x) x %>% utils::tail(1) %>% `[[`(1)

# Calculate delta in the final layer first
# start with (a^4 - o)
nablaC <- forward$activations %>%
  last_element() %>%
  `-`(outargs)

# Calculate sigma'
sigPrime <- forward$wInputs %>%
  last_element() %>%
  sig(prime = T)

deltaL <- nablaC %>%
  matrixcalc::hadamard.prod(sigPrime)

This gives us the ability to calculate the changes in the biases and weights, $$ \frac{\partial C}{\partial b^{4}} = \begin{bmatrix} 0.1462 \ -0.002681 \end{bmatrix} $$

$$ \frac{\partial C}{\partial w^{4}} = \delta^4 \cdot (a^{3})^{\top} = \begin{bmatrix} 0.1165 & 0.1024 & 0.1123 & 0.1319 \ -0.00214 & -0.00187 & -0.00206 & -0.00242 \end{bmatrix} $$ The above is a handy matrix form for calculating $\frac{\partial C}{\partial w^{l}{jk}}$. A handy way of figuring out the dimension is by taking the number of neurons in the $l^{\text{th}}$ and $(l - 1)^{\text{th}}$ layers. So by looking at our network, this will simply be ($2\times 4$). The above also looks a little different to the original definition, but we could have also written all this out in full, where $j = [1, 2]$ and $k = [1, 2, 3, 4]$. i.e. where $k$ and $j$ are as defined previously as the number of neurons in the $l^{\text{th}}$ and $(l - 1)^{\text{th}}$ layers respectively. So, writing out all the components in accordance to the last equation of the set of backpropagation equations, $$ \begin{split} \frac{\partial C}{\partial w{11}^{4}} &= a^{3}{1}\delta^{4}{1}, ~~ \frac{\partial C}{\partial w_{12}^{4}} = a^{3}{2}\delta^{4}{1}, ~~ \frac{\partial C}{\partial w_{13}^{4}} = a^{3}{3}\delta^{4}{1}, ~~ \frac{\partial C}{\partial w_{14}^{4}} = a^{3}{4}\delta^{4}{1}, \ \frac{\partial C}{\partial w_{21}^{4}} &= a^{3}{1}\delta^{4}{2}, ~~ \frac{\partial C}{\partial w_{22}^{4}} = a^{3}{2}\delta^{4}{2}, ~~ \frac{\partial C}{\partial w_{23}^{4}} = a^{3}{3}\delta^{4}{2}, ~~ \frac{\partial C}{\partial w_{24}^{4}} = a^{3}{4}\delta^{4}{2} \end{split} $$ Now, for simplicity we set $\frac{\partial C}{\partial w_{jk}^{l}} \equiv \tilde{w}{jk}^{l}$ (and similarly $\tilde{b}{j}^{l}$). Therefore, we can cast all the components above as $$ \begin{bmatrix} \tilde{w}{11}^{4} & \tilde{w}{12}^{4} & \tilde{w}{13}^{4} & \tilde{w}{14}^{4} \ \tilde{w}{21}^{4} & \tilde{w}{22}^{4} & \tilde{w}{23}^{4} & \tilde{w}{24}^{4} \end{bmatrix} = \begin{bmatrix} \delta^{4}{1}a^{3}{1} & \delta^{4}{1}a^{3}{2} & \delta^{4}{1}a^{3}{3} & \delta^{4}{1}a^{3}{4} \ \delta^{4}{2}a^{3}{1} & \delta^{4}{2}a^{3}{2} & \delta^{4}{2}a^{3}{3} & \delta^{4}{2}a^{3}{4} \end{bmatrix} = \begin{bmatrix} \delta^{4}{1} \ \delta^{4}{2} \end{bmatrix} \begin{bmatrix} a^{3}{1} & a^{3}{2} & a^{3}{3} & a^{3}{4} \end{bmatrix} = \delta^4 \cdot (a^{3})^{\top} $$

Remaining Layers

Now, by using the the $\delta$ recurrence relation, linking delta's in adjacent layers, we can use $\delta^4$ from above and calculate the remaining changes in the weights and biases. The calculation is as follows (removing the index of $j$ again);

$$ \delta^3 = ((w^{4})^{\top}\delta^{4}) \odot \sigma^{\prime}(\phi^3) = \begin{bmatrix} 0.3 & 0.6 \ 0.1 & 0.9 \ 0.4 & 0.7 \ 0.2 & 0.7 \ \end{bmatrix} \begin{bmatrix} 0.14615 \ -0.00268 \end{bmatrix} \odot \sigma^{\prime}\Big( \begin{bmatrix} 1.3702 \ 0.8495 \ 1.1988 \ 2.2246 \ \end{bmatrix} \Big) = \begin{bmatrix} 0.00682 \ 0.00256 \ 0.01007 \ 0.00241 \ \end{bmatrix} $$ The change in weights and biases are therefore $$ \frac{\partial C}{\partial b^{3}} = \begin{bmatrix} 0.00682 \ 0.00256 \ 0.01007 \ 0.00241 \ \end{bmatrix} $$ and, $$ \frac{\partial C}{\partial w^{3}} = \delta^3 \cdot (a^{2})^{\top} = \begin{bmatrix} ... & ... & ... & ... \ ... & ... & ... & ... \ ... & ... & ... & ... \ ... & ... & ... & ... \ \end{bmatrix} $$ The last iteration results in the following; $$ \delta^2 = ((w^{3})^{\top}\delta^{3}) \odot \sigma^{\prime}(\phi^2) = \begin{bmatrix} 0.3 & 0.1 & 0.9 & 0.7 \ 0.6 & 0.1 & 0.2 & 0.3 \ 0.5 & 0.4 & 0.2 & 0.8 \ \end{bmatrix} \begin{bmatrix} 0.00682 \ 0.00256 \ 0.01007 \ 0.00241 \ \end{bmatrix} \odot \sigma^{\prime}\Big( \begin{bmatrix} 1.8 \ 3.3 \ 0.7 \ \end{bmatrix} \Big) = \begin{bmatrix} 0.00159 \ 0.00024 \ 0.00186 \ \end{bmatrix} $$ $$ \frac{\partial C}{\partial b^{2}} = \begin{bmatrix} 0.00159 \ 0.00024 \ 0.00186 \ \end{bmatrix} $$ and, $$ \frac{\partial C}{\partial w^{2}} = \delta^3 \cdot (a^{2})^{\top} = \begin{bmatrix} 0.0016 & 0.0048 \ 0.0002 & 0.0007 \ 0.0019 & 0.0056 \ \end{bmatrix} $$

Backpropagation Snippet

The following snippet summarises the above, i.e. calculating all the values for the changes in Cost with respect to weights and biases. We know the dimensions of both $\tilde{w}{jk}^{l}$ and $\tilde{b}{j}^{l}$

# Set up function for initialising all elements to zero
zero_mat <- function(mat) {
  return(
    lapply(
      X = mat,
      FUN = function(y) {
        y[] <- 0
        return(y)
      }
    )
  )
} 

# We know the sizes of both 
nablaW <- exampleNN$weights %>% zero_mat()
nablaB <- exampleNN$bias %>% zero_mat()

Then we can start to backpropagate using the delta relation,

# Useful function for calculating 
nab_w <- function(del, act, val) del %*% (act %>% `[[`(val) %>% t)

# Put in the very last nabla's
nablaB[[nablaB %>% length]] <- deltaL
nablaW[[nablaW %>% length]] <- nab_w(
  del = deltaL,
  act = forward$activations,
  val = forward$activations %>% length %>% `-`(1)
)

# Loop over the remaining layers.. (backwards)!
# We know that the errors are related by adjacent layers
for (i in (exampleNN$layers - 2):1) {
  # Calculate sigma prime
  sprime <- forward$wInputs[[i]] %>% 
    sig(prime = T)

  # Calculate the new delta value
  nudel <- exampleNN$weights[[i + 1]] %>%
    t %*% 
    deltaL %>% 
    matrixcalc::hadamard.prod(sprime)

  # Update the b and w tilde values
  nablaB[[i]] <- nudel
  nablaW[[i]] <- nab_w(
    del = nudel,
    act = forward$activations,
    val = i
  )

  # Use this delta in the next iteration
  deltaL <- nudel
}

# Store the nabla's
backwards <- list(
  nablaB = nablaB,
  nablaW = nablaW
)

# Check these results with the calculated above
backwards

Now that we have these details for each (one in this case) piece of training data, we can use the stochastic gradient descent approach to approximate the changes to the weights and biases, found in exampleNN. Of course, in reality, we use more than one row of training data to make the approximations, but to keep this simple we have just went through this example once. This will infact be carried out multiple times, in batches, and the approximation will be made on that data.

# Set up factor
fact <- eta %>% `/`(1) # This will normally be the number of training rows

# Update weights
exampleNN$weights <- lapply(
  X = 1:(exampleNN$weights %>% length),
  FUN = function(x) exampleNN$weights[[x]] %>% `-`(fact * backwards$nablW[[x]])
)

# Update biases
exampleNN$bias <- lapply(
  X = 1:(exampleNN$bias %>% length),
  FUN = function(x) exampleNN$bias[[x]] %>% `-`(fact * backwards$nablB[[x]])
)