Backpropagation - Fixing blunders.

2nd November 2024

Experience is the name everyone gives to their mistakes. – Oscar Wilde

Index

  1. Neural Networks
  2. Backpropagation (You are here)
  3. Machine Learning
  4. MNIST Number Classification

In the last part we saw what a neural network is and how it learns. Now, we need to fix its mistakes since it's common to mess up while learning. If we recall what alters the activation, It's mainly the change in 3 things:

  1. Bias
  2. Weights
  3. Activation of the previous layer

Backpropagation is an algorithm to calculate the negative gradient to nudge your weights or biases.

That's a long sentence, but to understand that, we need to understand the calculus behind it. But first let us see what affects the output of our neural network.

For the sake of convention we shall we shall superscript the layer. For example, the activation of the Lth^{th} layer will be represented as a(L)a^{(L)}.

Take a look at this flowchart:

w(L),a(L1),b(L)Z(L)a(L)C0w^{(L)}, a^{(L-1)}, b^{(L)} \\ ↓ \\ Z^{(L)} \\ ↓ \\ a^{(L)} \\ ↓ \\ C_0

As we can see from the above flow, The weights of the layer, the activation of the previous layer and our bias affects our pre-squished function which we represent as some Z(L)Z^{(L)}. This value affects our activation for the layer which inturn affects our cost function. To write all this mathematically:

C0=(a(L)y)2Z(L)=w(L)a(L1)+b(L)a(L)=σ(Z(L))C_0 = (a^{(L)} - y)^2 \\ Z^{(L)} = w^{(L)} a^{(L-1)} + b^{(L)} \\ a^{(L)} = \sigma (Z^{(L)})

where yy is 1 only at expected activation. To propagate backwards, we need to find out what change in the affecting factors lead to how much change in the cost function C0C_0. This function is then optimized to achieve ideal results.

Alright let's talk math, we'll be needing δC0δw(L)\frac{\delta C_0}{\delta w^{(L)}} or the change in cost function with respect to the weights. From the flowchart above, we can say that the resultant is influenced by:

δC0δw(L)=δz(L)δw(L)δa(L)δz(L)δC0(L)δa(L)\frac{\delta C_0}{\delta w^{(L)}} = \frac{\delta z^{(L)}}{\delta w^{(L)}} \frac{\delta a^{(L)}}{\delta z^{(L)}} \frac{\delta C_0^{(L)}}{\delta a^{(L)}}

Now this is just for one case, and we have kk neurons. Let us assume the below structure for the network

Let the first layer have k neurons and the last one have j. For readability and convention, we shall subscript the neuron number and superscript the layer.

Our new cost function is now defined as:

C0=j=0nL1(aj(L)yj)2C_0 = \sum_{j=0}^{n_L-1} (a_j^{(L)} - y_j)^2

When we measure it with respect to the change in activation, our function now becomes

δC0δak(l1)=j=0nL1δzj(L)δak(L)δaj(L)δzj(L)δC0(L)δaj(L)\frac{\delta C_0}{\delta a_k^{(l-1)}} = \sum_{j=0}^{n_L-1} \frac{\delta z_j^{(L)}}{\delta a_k^{(L)}} \frac{\delta a_j^{(L)}}{\delta z_j^{(L)}} \frac{\delta C_0^{(L)}}{\delta a_j^{(L)}}

The above math might require pen and paper to work out and understand properly so take your time for it. But that's what backpropagation is, just a bunch of calculus to write an algorithm to determine how to minimize your cost function.

In the next part of the series we shall look at what are machine learning models and it's classification.

← Previous
Neural Networks

Next →

Machine Learning