<center><h1>Localization</h1></center>

<center><h2><a href="https://deepcourse-epita.netlify.app/">Course link</a></h2></center>

To keep your modifications in case you want to come back later to this colab, do *File -> Save a copy in Drive*.

If you find a mistake, or know how to improve this notebook, please open an issue [here](https://gitlab.com/ey_datakalab/course_epita).

In [None]:
!wget https://deepcourse-epita.netlify.app/code/loc/imagenet.json

In [None]:
%pylab inline

In [None]:
import json

import matplotlib.pyplot as plt
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from torchvision import transforms
import torchvision
from PIL import Image
from nltk.corpus import wordnet as wn

In this colab, we are going to **localize** objects in an image, without even training a model.

Or to be more correct, with only taking a model pretrained on classification.

Let's load our resnet pretrained on ImageNet:

In [None]:
net = torchvision.models.resnet50(pretrained=True)

We need to modify its forward function. Ideally we should have download the code and modify it. But on colab, it may be a bit unpracticable, so we are going to monkey-patch the forward function.

We strip the last global average pooling (`self.avgpool`) and the classifier (`self.fc`) and we add a new layer that we call `self.conv1x1`:

In [None]:
def forward(self, x):
  x = self.conv1(x)
  x = self.bn1(x)
  x = self.relu(x)
  x = self.maxpool(x)

  x = self.layer1(x)
  x = self.layer2(x)
  x = self.layer3(x)
  x = self.layer4(x)

  # We remove those two layers...
  # x = self.avgpool(x)
  # x = self.fc(x)

  # ... and we add this layer:
  x = self.conv1x1(x)

  return x


net.forward = forward.__get__(
    net,
    torchvision.models.ResNet
);  # monkey-patching

We are going to exploit an important fact to do our localization for free.

Convolution with 1x1 kernel, also called **pointwise convolutions**, are actually a fully-connected layer applied independently on every pixels.

We need the spatial dimension, that we loose with global pooling and fully-connected, to do localization. Therefore we are going to convert a fully-connected layer in a 1x1 convolution.

Inspect the weights and bias shapes of both layers, and try to find a way to transfer the parameters learned by one to the other:

In [None]:
print("fc weight and bias:", net.fc.weight.data.shape, net.fc.bias.data.shape)

conv1x1 = nn.Conv2d(2048, 1000, kernel_size=1)

print("conv1x1 weight and bias:", conv1x1.weight.data.shape, conv1x1.bias.data.shape)

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/loc/conv.py
%pycat conv.py

And don't forget to provide this new conv to the network:

In [None]:
net.conv1x1 = conv1x1

Just to be sure, check the output dimension:

In [None]:
net(torch.randn(1, 3, 224, 224)).shape

Now we want to localize a class in particular. Let's say the class "dog". You can take an other class if you want, I only use this one because it is the most frequent class of ImageNet.

Among ImageNet's 1000 classes, there are hundred of dog species.

Luckily, ImageNet's classes are based on the hierarchy of **wordnet**. Therefore we can find classes belonging to "dog".

First, we load wordnet:

In [None]:
import nltk
nltk.download('wordnet')

Then, we load the classes of ImageNet and their id (**synset**) in the wordnet hierarchy

In [None]:
with open("imagenet.json") as f:
  imagenet_labels = json.load(f)

In [None]:
!head imagenet.json

We aggregate all dog classes:

In [None]:
class_to_localize = "dog"
nltk.download('omw-1.4')
parent_name = wn.synsets(class_to_localize)[0]._name
print(f"Parent is <{parent_name}>")

indexes = set()
names = set()

for index, metadata in imagenet_labels.items():
  base_synset = wn.synset_from_pos_and_offset("n", int(metadata["id"].split('-')[0]))

  synset = base_synset
  while synset._name != parent_name:
    parents = synset.hypernyms()
    if len(parents) == 0:  # no more parents, we are at the root
      break

    synset = parents[0]

  if synset._name == parent_name:
    indexes.add(int(index))
    names.add(base_synset._name)

indexes = torch.tensor(list(indexes))
print(f"There are {len(indexes)} classes")
list(names)[:10]

Let's try our model on an image with two dogs (but you can use any image you want):

In [None]:
!wget https://deepcourse-epita.netlify.app/code/loc/2dogs.jpg

In [None]:
image = Image.open("2dogs.jpg")
image.thumbnail((512, 512), Image.ANTIALIAS)
image

We still need to preprocess our imagenet in the same way the model was trained: 

In [None]:
imagenet_mean = torch.tensor([0.485, 0.456, 0.406])
imagenet_std = torch.tensor([0.229, 0.224, 0.225])

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(imagenet_mean, imagenet_std),
])

Now code a method to get the *attention* of the model which will be our localization.

Here are the steps:

1. resize image to the asked dimensions
2. preprocess the image
3. extract the spatial logits
4. use softmax alongside the correct dimension
5. only keep the channels in `indexes` that we computed from wordnet and sum them


*Pro-tip*: to avoid storing intermediary activations that are useful only when doing backward, use the context manager `torch.no_grad()`:

```python
with torch.no_grad():
  y = net(x)
```

In [None]:
def generate_attention(path, size=(224, 224)):
  image = Image.open(path)
  image.thumbnail(size, Image.ANTIALIAS)

  # TODO

  return image, attention_map

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/loc/attn.py
%pycat attn.py

Let's try it!

In [None]:
image, attn = generate_attention("2dogs.jpg", (224, 224))

plt.figure(figsize=(10, 8))
ax = plt.subplot(1, 2, 1)
ax.axis("off")
plt.imshow(image)

ax = plt.subplot(1, 2, 2)
ax.axis("off")
plt.imshow(attn)

Hum... Not that great right? I mean we knew it won't be very precise, but this is super bad.

But what if the resolution was larger?...

In [None]:
image, attn = generate_attention("2dogs.jpg", (512, 512))

plt.figure(figsize=(10, 8))
ax = plt.subplot(1, 2, 1)
ax.axis("off")
plt.imshow(image)

ax = plt.subplot(1, 2, 2)
ax.axis("off")
plt.imshow(attn)

larger?

In [None]:
image, attn = generate_attention("2dogs.jpg", (1024, 1024))

plt.figure(figsize=(10, 8))
ax = plt.subplot(1, 2, 1)
ax.axis("off")
plt.imshow(image)

ax = plt.subplot(1, 2, 2)
ax.axis("off")
plt.imshow(attn)

larger?!

In [None]:
image, attn = generate_attention("2dogs.jpg", (1500, 1500))

plt.figure(figsize=(10, 8))
ax = plt.subplot(1, 2, 1)
ax.axis("off")
plt.imshow(image)

ax = plt.subplot(1, 2, 2)
ax.axis("off")
plt.imshow(attn)

You should see more precisely the shapes of the dogs as the resolution (and computational cost) increases.

But as the resolution is larger, there are also more artefacts where the model thinks it has found a dog somewhere in the background.

A solution, which is essential to even modern network in segmentation, is to exploit **multiple scales**.

Compute the attention with the same image resized at different dimensions, and combine all those attentions together. There can be different aggregation method although I recommend a [geometric mean](https://en.wikipedia.org/wiki/Geometric_mean).

In [None]:
final_attn = # TODO

sizes = (112, 224, 512, 1024, 1500)
for size in sizes:
  image, attn = generate_attention("2dogs.jpg", (size, size))
  resized_attn = F.interpolate(attn[None, None], (32, 47))[0, 0]
  # TODO

final_attn = # TODO

plt.figure(figsize=(10, 8))
ax = plt.subplot(1, 2, 1)
ax.axis("off")
plt.imshow(image)

ax = plt.subplot(1, 2, 2)
ax.axis("off")
plt.imshow(final_attn)

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/loc/geom.py
%pycat geom.py

It should be better than previous results.

Of course, we don't have a super precise segmentation of the classes with this method. But we managed to do some crude localization without any training.

If we wanted to improve results, we could finetune this 1x1 convolution using actual segmentation data. This is more or less what **Fully Convolutional Network** published at CVPR 2015 did! Read about the paper [here](https://arxiv.org/pdf/1411.4038.pdf), and if you're feeling courageous, implement it!