In this post, I’ll describe how a neural network with two hidden layers works. The code is highly unoptimized to make it as simple to understand as possible. I’ll train the model on a part of MNIST dataset. So, you will need to download this file containing both the labels (1st column) and the variables. Size of **y** is 42000×1, and the size of **X** is 42000×784. Every line of **X** is a 28×28 grayscale picture of a handwritten number. Every element of **y** is a number from 0 to 9.

The whole code is here with the explanations following after it:

The neural network part is pretty short:

The most interesting is, probably, backpropagation:

**Step 1. Calculate the loss and its derivative**

The first thing you need to know here is the loss function you are going to use. Here I use Squared Error (more precisely it is 1/2 SE).

In order to propagate the loss, first of all, we need to calculate the derivative of the loss w.r.t the prediction vector **yhat**.

If you use any othe loss function, you need to find its derivative w.r.t **yhat**.

**Step 2. Calculate the gradient of the matrix of parameters W3**

**Step 3. Calculate the gradient of the second hidden layer**

**Step 4. Calculate the gradient of the matrix of parameters W2**

**Step 5. Calculate the gradient of the first hidden layer**

**Step 6. Calculate the gradient of the matrix of parameters W1**

**Step 7. Update parameters W1, W2, and W3**

The output will look like the following: