<center><h1>Transfer Learning for CNNs</h1></center>

<center><h2><a href="https://deepcourse-epita.netlify.app/">Course link</a></h2></center>

To keep your modifications in case you want to come back later to this colab, do *File -> Save a copy in Drive*.

If you find a mistake, or know how to improve this notebook, please open an issue [here](https://gitlab.com/ey_datakalab/course_epita).

In [None]:
!rm -rf PetImages *.docx *.txt *zip
!wget https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
!unzip -q kagglecatsanddogs_3367a.zip

--2021-06-04 15:05:41--  https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
Resolving download.microsoft.com (download.microsoft.com)... 2.18.233.19, 2a02:26f0:5c:486::e59, 2a02:26f0:5c:4ac::e59
Connecting to download.microsoft.com (download.microsoft.com)|2.18.233.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 824894548 (787M) [application/octet-stream]
Saving to: ‘kagglecatsanddogs_3367a.zip’


2021-06-04 15:05:47 (128 MB/s) - ‘kagglecatsanddogs_3367a.zip’ saved [824894548/824894548]



In [None]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [None]:
import time

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder
from torchvision import transforms
import torchvision
from PIL import Image

Let's create our loaders. Because we will use a model pretrained on imagenet, we need to re-use the exact same preprocessing (here based on imagenet mean and std).

If you don't (please try), performance will dramatically suffer. This is a common source of bugs, so I'm sure you will encounter it one day. Therefore, when doing transfer learning but performance is bad, check the preprocessing!

In [None]:
imagenet_mean = torch.tensor([0.485, 0.456, 0.406])
imagenet_std = torch.tensor([0.229, 0.224, 0.225])

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(imagenet_mean, imagenet_std),
])

First load the dataset whose structure is class_name/id.jpg:

In [None]:
dataset = ImageFolder('PetImages', transform=transform)
len(dataset)

25000

To be faster we are going to use less than 25k images, but if you have time, use all images: you will get better performance.

In [None]:
dataset, _ = torch.utils.data.random_split(dataset, [2000, 23000])
len(dataset)

2000

Let's split our dataset in train and test (we omit validation for simplicity here, but in real-life project it is super important!):

In [None]:
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [1500, 500])
len(train_dataset), len(test_dataset)

(1500, 500)

Let's create our loaders. Because we will use a model pretrained on imagenet, we need to re-use the exact same preprocessing (here based on imagenet mean and std).

If you don't (please try), performance will dramatically suffer. This is a common source of bugs, so I'm sure you will encounter it one day. Therefore, when doing transfer learning but performance is bad, check the preprocessing!

In [None]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

print(f"Nb batches in train: {len(train_loader)}")
print(f"Nb batches in test: {len(test_loader)}")

Nb batches in train: 47
Nb batches in test: 16


Now, we create a ResNet18 model with its weights pretrained on ImageNet:

In [None]:
net = torchvision.models.resnet18(pretrained=True).cuda()
net

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

In [None]:
nb_params = sum(p.numel() for p in net.parameters())
print('Nb of parameters', nb_params)

Nb of parameters 11689512


Notice that the network is finished by a a global avg pooling (`AdaptiveAvgPool2d`) and a classifier (`fc`).

We are going to extract the features and dump them on disk. Therefore, we don't need the last fc layer, thus we will replace by an identity:

In [None]:
net.fc = nn.Identity()

In [None]:
def extract_features(loader, net):
  pass # TODO
  # Return a tensor of features, and a tensor of labels

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/cnn/extract.py
%pycat extract.py

In [None]:
train_features, train_labels = extract_features(train_loader, net)
print(train_features.shape, train_labels.shape)

torch.Size([1500, 512]) torch.Size([1500])


In [None]:
test_features, test_labels = extract_features(test_loader, net)
print(test_features.shape, test_labels.shape)

torch.Size([500, 512]) torch.Size([500])


Now, let's define a dataset and loader for features instead of images:

In [None]:
class FeaturesDataset(torch.utils.data.Dataset):
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels

  def __len__(self):
    return len(self.features)

  def __getitem__(self, index):
    return self.features[index], self.labels[index]


train_feat_loader = DataLoader(FeaturesDataset(train_features, train_labels), batch_size=64)
test_feat_loader = DataLoader(FeaturesDataset(test_features, test_labels), batch_size=64)

len(train_feat_loader), len(test_feat_loader)

(24, 8)

In [None]:
def eval_model(net, loader, loss_fn):
  net.eval()
  acc, loss = 0., 0.
  c = 0
  for x, y in loader:
    with torch.no_grad():
      # No need to compute gradient here thus we avoid storing intermediary activations
      logits = net(x.cuda()).cpu()

    loss += loss_fn(logits, y).item()
    preds = logits.argmax(dim=1)
    acc += (preds.numpy() == y.numpy()).sum()
    c += len(x)

  acc /= c
  loss /= len(loader)
  net.train()
  return round(100 * acc, 2), round(loss, 5)

And now train a simple linear classifier:

In [None]:
classifier = nn.Linear(512, 2).cuda()
optimizer = torch.optim.SGD(classifier.parameters(), lr=0.01)
epochs = 10

for epoch in range(epochs):
  for features, labels in train_feat_loader:
    optimizer.zero_grad()

    features, labels = features.cuda(), labels.cuda()
    logits = classifier(features)
    loss = F.cross_entropy(logits, labels)
    
    loss.backward()
    optimizer.step()

  train_acc, train_loss = eval_model(classifier, train_feat_loader, F.cross_entropy)
  test_acc, test_loss = eval_model(classifier, test_feat_loader, F.cross_entropy)

  print(f"Epoch {epoch}, train: {train_acc}%/{train_loss}, test: {test_acc}%/{test_loss}")

Epoch 0, train: 94.73%/0.21507, test: 94.4%/0.2187
Epoch 1, train: 95.27%/0.16476, test: 94.6%/0.16754
Epoch 2, train: 95.67%/0.14195, test: 94.4%/0.14547
Epoch 3, train: 96.07%/0.12816, test: 95.0%/0.13279
Epoch 4, train: 96.4%/0.11853, test: 95.0%/0.1244
Epoch 5, train: 96.67%/0.11123, test: 95.0%/0.11836
Epoch 6, train: 96.87%/0.10536, test: 95.4%/0.11376
Epoch 7, train: 96.93%/0.10045, test: 95.6%/0.11011
Epoch 8, train: 97.0%/0.09624, test: 95.8%/0.10714
Epoch 9, train: 97.07%/0.09255, test: 95.8%/0.10465


Performances are already great, with only a finetuning of the final classifier!

But that's the magic of transfer learning. Try learning from scratch, even a simple dataset as Cat vs Dog, and you will see it is much harder.

Moreover, ImageNet is made of 1000 classes. But among these 1000 classes, there are hundred of classes of cats and dogs of different species. Therefore the extracted features are already well defined.

# Finetuning the CNN

But we can do better. Our previous approach has two drawbacks:
- because we extract the features once, we cannot do a different image augmentation at each iteration
- we don't tune at all the CNN, which will provide certainly a gain of performances

We are now going to tune part of the CNNs. Ideally, if we tune all the CNN we should get the best performance, but the training will also be slower. Therefore for now, we will only tune the last block of our ResNet18, but feel free to try given more time to tune the whole network.

Let's enable back the classifier of our ResNet, but this time with only 2 output classes instead of 1000:

In [None]:
net.fc = nn.Linear(512, 2).cuda()

To know what to freeze, we have to look at the original codebase here: https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L144

Given the function I provide, `freeze()`, selectively freeze part of the network so that only the final block (called `layer4`) and the classifier (`fc`) are trained:

In [None]:
def freeze(module):
  for param in module.parameters():
    param.requires_grad = False
  # Important because some layer like BatchNorm have a different behavior in train vs test:
  module.eval()


# TODO, freeze part of the network

In [None]:
# Execute this cell to see the solution, but try to do it by yourself before!
!wget https://deepcourse-epita.netlify.app/code/cnn/freeze.py
%pycat freeze.py

In [None]:
optimizer = torch.optim.SGD( # only provide learnable parameters
    filter(lambda x: x.requires_grad, net.parameters()), lr=0.01
)

epochs = 10

for epoch in range(epochs):
  for features, labels in train_loader:
    optimizer.zero_grad()

    features, labels = features.cuda(), labels.cuda()
    logits = net(features)
    loss = F.cross_entropy(logits, labels)
    
    loss.backward()
    optimizer.step()

  train_acc, train_loss = eval_model(net, train_loader, F.cross_entropy)
  test_acc, test_loss = eval_model(net, test_loader, F.cross_entropy)

  print(f"Epoch {epoch}, train: {train_acc}%/{train_loss}, test: {test_acc}%/{test_loss}")

Epoch 0, train: 98.87%/0.04115, test: 98.6%/0.04536
Epoch 1, train: 99.0%/0.05112, test: 97.6%/0.06854
Epoch 2, train: 99.4%/0.03492, test: 98.0%/0.05846
Epoch 3, train: 99.8%/0.02415, test: 98.0%/0.04936
Epoch 4, train: 99.87%/0.01804, test: 98.2%/0.04476
Epoch 5, train: 99.87%/0.01358, test: 98.0%/0.04325
Epoch 6, train: 100.0%/0.00985, test: 98.2%/0.04368
Epoch 7, train: 100.0%/0.00824, test: 98.2%/0.04358
Epoch 8, train: 100.0%/0.00672, test: 98.0%/0.04587
Epoch 9, train: 100.0%/0.00533, test: 98.6%/0.03897


Results should be better, close to 99% in test. Of course, in this case the problem is easy. Unfreezing more of the CNN won't provide large gains.

But in general, it's usually better to finetune all the CNN. In order not to "lose" the original features learned on ImageNet, we usually do the training learning with a lower learning rate than in training.

Finally, transfert learning is very dependent on the similarity of the source (here ImageNet) and target domain (here cat vs dog). If you wanted to apply transfer learning to medical imagery for example, the gain will be much lower.
