<center><h1>Backpropagation From Scratch</h1></center>

<center><h2><a href="https://deepcourse-epita.netlify.app/">Course link</a></h2></center>

To keep your modifications in case you want to come back later to this colab, do *File -> Save a copy in Drive*.

If you find a mistake, or know how to improve this notebook, please open an issue [here](https://gitlab.com/ey_datakalab/course_epita).

In this notebook, we'll code a **Logistic Regression** and a two-layers **Multi-Layer Perceptron** (MLP) from scratch.

We'll use **pytorch** a deep learning framework. It provides many utilities for neural networks (layers, optimizers, automatic differentiation), but we'll only use its API to manipulate tensors as we would with numpy.

The goal will be to learn a model with **backpropagation** in order to classify digits (0, 1, 2, ..., 9) from images.

---

# 1. Loading Data

In [None]:
%matplotlib inline

In [None]:
import torch
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

digits = load_digits()

In [None]:
X = torch.tensor(digits["images"]).float()
Y = torch.tensor(digits["target"]).long()

print(f"Images shape: {X.shape}, targets shape: {Y.shape}")

We have 1797 images, each of size 8x8. As you see, there is no **channel dimension** meaning that our images are in grayscale. Which is ok for now, as we only want to classify digits.

The targets shape is 1797 because, for each image a target is simply an integer representing the digit.

Now that our data is loaded, we need to visualize it. Always look at your data before doing anything! Something may be wrong with the data (not in this case).

In [None]:
plt.subplot(1, 3, 1)
plt.imshow(X[42], cmap="gray")
plt.title(f"Digit: {Y[42]}");

plt.subplot(1, 3, 2)
plt.imshow(X[64], cmap="gray")
plt.title(f"Digit: {Y[64]}");

plt.subplot(1, 3, 3)
plt.imshow(X[1337], cmap="gray")
plt.title(f"Digit: {Y[1337]}");

For this dataset, the pixels are coded with 4 bits, meaning that we can have values from 0 to 16 ($2^4$). In deep learning, it's very important to normalize the data in order to have all inputs of relatively the same magnitude.

**Beware**: this preprocessing must be done for both the train and test sets! A lot of bugs come from using a slightly different preprocessing between the two sets.

In [None]:
print(f"Min and max value of images pixels [{X.min()}, {X.max()}]")
X = X / 16
print(f"Min and max value of normalized images pixels [{X.min()}, {X.max()}]")

MLP only accepts inputs that are vectors, thus we will flatten our images into vectors:

In [None]:
# Flatten images as vectors
print(f"Images shape: {X.shape}")
X = X.view(X.shape[0], -1)
print(f"Flatten images shape: {X.shape}")

---

# 2. Activation Functions

Now that the data is loaded and preprocessed, we need to code the non-linear activation functions that are essential to deep learning.

First let's start by `softmax`, the final activation, that will give us a probability per digit. It takes as input a vector of **logits** (the final outputs of the network before softmax, one value per digit) and returns a vector of probabilities.

Here is the formula for the $i^\text{th}$ probability:

$$\text{softmax}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

In [None]:
def softmax(x):
  pass # TODO

print(softmax(torch.tensor([1., 2., 3.])))
print(softmax(torch.tensor([3., -0.12, -4.2, 9])))

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/backpropagation/softmax_base.py
%pycat softmax_base.py

Because the output vector is a probability distribution, all individual probabilities should sum to 1, let's check that:

In [None]:
print(softmax(torch.tensor([1., 2., 3.])).sum())
print(softmax(torch.tensor([3., -0.12, -4.2, 9])).sum())

What if our model is very very confident about class 0:

In [None]:
softmax(torch.tensor([234., 3., 4.]))

Why did we have a NaN? How can we fix it?

In [None]:
def softmax(x):
  pass # TODO

softmax(torch.tensor([234., 3., 4.]))

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/backpropagation/softmax_nan.py
%pycat softmax_nan.py

Perfectly identical:

$$\operatorname{softmax}(\mathbf{x} - c)_i = \frac{e^{x_i - c}}{\sum_j e^{x_j - c}}$$
$$\operatorname{softmax}(\mathbf{x} - c)_i = \frac{e^{-c} e^{x_i}}{e^{-c} \sum_j e^{x_j}}$$
$$\operatorname{softmax}(\mathbf{x} - c)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$$
$$\operatorname{softmax}(\mathbf{x} - c)_i = \operatorname{softmax}(\mathbf{x})_i$$


In practice, we will have mini-batch, i.e. our probabilities tensor will be of shape $(B, C)$:

In [None]:
x = torch.tensor([
    [1., 2., 3.],
    [4., 9., -12.]
])
probabilities = softmax(x)

print(probabilities.sum())
print(probabilities.sum(dim=1))

The whole batch probabilities sum to 1! That's not what we want. What did go wrong? Remember that most pytorch function can be applied only alongside a dimension.

In [None]:
def softmax(x):
  pass # TODO

x = torch.tensor([
    [1., 2., 3.],
    [4., 9., -12.]
])
probabilities = softmax(x)

print(probabilities.sum(dim=1))

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/backpropagation/softmax_batch.py
%pycat softmax_batch.py

Now, what about a loss? Let's code the cross-entropy!

Tips:
- Remember about the dimensions, we have mini-batches
- Log of 0 is undefined, what trick can we do then?

In [None]:
def cross_entropy(probs, targets):
  pass  # TODO

probs = torch.tensor([
    [0.9, 0.1],
    [0.7, 0.3],
    [0.2, 0.8],
    [0.6, 0.4]
])

targets = torch.eye(2)[torch.tensor([0, 0, 0, 1])]
cross_entropy(probs, targets)

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/backpropagation/ce.py
%pycat ce.py

Loss is most important when we are wrong by a large margin. Likewise it is smallest when we are extremely confident. Now you should see that the cross-entropy is maximizing the confidence (also known as the *likelihood*) into the ground-truth class.

Try running the cross-entropy with different probabilities to get an intuition about it.

---

# 3. Logistic Regression

Now that we have our `softmax` activation and `cross-entropy` loss, we can code a **logistic regression**. Behind this fancy name, it's simply a 1-layer neural network followed a softmax.

Here is the forward formula:

$$\tilde{\mathbf{y}} = \mathbf{X}\mathbf{W} + \mathbf{b}$$
$$\hat{\mathbf{y}} = \text{softmax}(\tilde{\mathbf{y}})$$
$$\mathcal{L} = -\frac{1}{B} \sum_{b=1}^B y_b \log \hat{\mathbf{y}}_b $$

With $\mathbf{X} \in \mathbb{R}^{B \times N}$, $\mathbf{W} \in \mathbb{R}^{N \times C}$, and $\mathbf{b} \in \mathbb{R}^{C}$. With $B$ being the batch size, $N$ the number of input pixels, and $C$ the number of classes.

For the backward, We can simplify formulas with a shortcut by taking directly the gradient of the loss $\mathcal{L}$ with relation to (w.r.t) the logits $\tilde{\mathbf{y}}$ (*see course for details*):

$$\nabla_\tilde{\mathbf{y}} \mathcal{L} = \hat{\mathbf{y}} - \mathbf{y}$$

Only two gradients are of interest: the one with relation to (w.r.t) the weights $\mathbf{W}$ and $\mathbf{b}$, the neurons we want to update. 

$$\nabla_\mathbf{W} \mathcal{L} = (\nabla_\mathbf{W} \tilde{\mathbf{y}})^T \nabla_\tilde{\mathbf{y}} \mathcal{L}$$
$$\nabla_\mathbf{b} \mathcal{L} = \nabla_\tilde{\mathbf{y}} \mathcal{L}$$

**Hint**: Look at the shape of each tensor if you're confused, i.e., the gradient $\nabla_\mathbf{W} \mathcal{L}$ should have the same shape as $\mathbf{W}$!


We also recall that the gradient with respect to W are a matrix based on $X$ and not simply $X$
$$ \frac{\partial WX + b}{\partial W_{k,l}} = \frac{\sum_{j=1}^{J} W_{k,j} X_{j} + b_k}{\partial W_{k,l}} = X_l
$$


In [None]:
class LogisticRegression:
    def __init__(self, input_size, nb_classes, learning_rate=0.5):
        self.w =  # TODO
        self.b =  # TODO
        
        self.learning_rate = learning_rate
        
    def forward(self, x):
        return  # TODO
    
    def fit(self, inputs, targets, train=True):
        # TODO
        return loss
    
    def backward(self, inputs, probs, targets):
        # TODO, should be called by `fit`
        
    def accuracy(self, inputs, targets):
        # TODO

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/backpropagation/logreg.py
%pycat logreg.py

In [None]:
model = LogisticRegression(X.shape[1], len(torch.unique(Y)), 0.5)

Let's measure the accuracy of the untrained model. What does it, roughtly, correspond to? Any idea why?

In [None]:
model.accuracy(X, Y)

Let's train! We are going to see the whole dataset `nb_epochs` times, by chunk of `batch_size` images.

**Note** that we are training and testing on the same set here for simplicity, but in later courses, or in real-life, don't do that.

In [None]:
batch_size = 32
nb_epochs = 10

epochs, accuracies, losses = [], [], []

for epoch in range(nb_epochs):
    for batch_index in range(0, len(X), batch_size):
        batch_X = X[batch_index:batch_index + batch_size]
        batch_Y = Y[batch_index:batch_index + batch_size]
    
        loss = model.fit(batch_X, batch_Y)
        
    loss = model.fit(X, Y, train=False)
    acc = model.accuracy(X, Y)
    
    print(f"Epoch: {epoch}, loss: {loss}, accuracy: {acc}")
    epochs.append(epoch)
    losses.append(loss)
    accuracies.append(acc)
    
plt.subplot(1, 2, 1)
plt.plot(epochs, losses)
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.subplot(1, 2, 2)
plt.plot(epochs, accuracies)
plt.xlabel("Epoch")
plt.ylabel("accuracy");

--- 

# 4. Multi-Layer Perceptron

Now, let's build a Multi-Layer Perceptron, aka a neural network with hidden layers.

Hidden layers imply hidden activations. `tanh` is already implemented for you: `torch.tanh`.

This function is applied **element-wise**, meaning that it is applied independently on every point of the tensor (not like softmax). We now need the gradient of this function:

In [None]:
def grad_tanh(tanh_results):
    return  # TODO

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/backpropagation/gradtanh.py
%pycat gradtanh.py

Now the MLP is like the Logistic Regression we coded previously, but with hidden layers!

We'll only start with one hidden layer to start. The forward should be straightforward, and for the backward try to derive it by yourself. If you're really stuck, you can have a look at the code solution, or the course. But after you've tried enough!

In [None]:
class MLP:
    def __init__(self, input_size, hidden_size, nb_classes, learning_rate=0.5):
        self.w_hidden =  # TODO
        self.b_hidden =  # TODO
        
        self.w_output =  # TODO
        self.b_output =  # TODO
        
        self.learning_rate = learning_rate
        
    def forward(self, x):
       # TODO
       # Remember to keep all intermediary values that are needed for the backward pass
        
    def fit(self, inputs, targets, train=True):
        # TODO
        return loss
    
    def backward(self, inputs, targets, h_tilde, h, logits, probs):
        # TODO
        
    def accuracy(self, inputs, targets):
        # TODO

The same code used for the Logisitic Regression can also be used for the MLP.

That's the beauty of it, as long as our model can take in inputs images and predicts their digits, we don't care about the internals in the training loops:

In [None]:
model = MLP(X.shape[1], 50, len(torch.unique(Y)), 0.5)
model.accuracy(X, Y)

batch_size = 32
nb_epochs = 10

epochs, accuracies, losses = [], [], []

for epoch in range(nb_epochs):
    for batch_index in range(0, len(X), batch_size):
        batch_X = X[batch_index:batch_index + batch_size]
        batch_Y = Y[batch_index:batch_index + batch_size]
    
        model.fit(batch_X, batch_Y)
        
    loss = model.fit(X, Y, train=False)
    acc = model.accuracy(X, Y)
    
    print(f"Epoch: {epoch}, loss: {loss}, accuracy: {acc}")
    epochs.append(epoch)
    losses.append(loss)
    accuracies.append(acc)
    
plt.subplot(1, 2, 1)
plt.plot(epochs, losses)
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.subplot(1, 2, 2)
plt.plot(epochs, accuracies)
plt.xlabel("Epoch")
plt.ylabel("accuracy");

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/backpropagation/mlp.py
%pycat logreg.py

--- 

# Notebook Summary

We learn how to:
- code a robust softmax
- logistic regression with forward and backward pass
- MLP with forward and backward pass

# Further Works

- Try to implement a MLP with two hidden layers, or three, or as much as the user want, through a simple API.
- Does the initialization of the weights matter? Try tweaking it (more on it in later courses)