Cross-Pollination is all you need: Transfer learning in AI

Generated using DALL·E

In our previous blog, we talked about the theory related to transfer learning. We discussed the recommended method for transfer learning: Freeze the layers with weights from the pretrained model for the first few epochs and only train the new or the last few layers during this time. But is this really the best approach?

That’s what we will try to find out In this post. We will share our experiments, our methodology, and the results.

If you want to cut to the chase and find the best method for transfer learning, just scroll down to the last section of this post. But if you enjoy following through on our journey, read along!

The Dataset

For our experiments, we used the quickdraw doodle classification dataset, specifically, we used numpy 28x28 doodle images. These images were from 10 classification categories - tractor, toothpaste, toothbrush, skull, spider, toilet, mountain, sword, marker, and sheep. The training and the test dataset consisted of 10,000 images for each category. Following are some sample images.

Apart from just observing the model performances on the clean test dataset, we also added some synthetic noise to the data to see the out-of-the-box noise robustness of each training methodology. Following are some noisy images.

We used EfficientNet b0, a model built to solve the imagenet classification task. We devised various strategies to build models using transfer learning and the basic CNN approach to see how well each transfer learning method performed as compared to each other and also against a basic CNN model.

The Training Approaches

The training approaches we used were:

  • Approach 1: Basic CNN built from scratch.
  • Approach 2: EfficientNet b0 architecture used without porting weights at all. This will help us know exactly how powerful the architecture is, without the knowledge gained from solving the imagenet task.
  • Approach 3: Transfer learning in the recommended way - initialize pretrained weights of unmodified layers, freeze the pretrained weights for a few epochs, then unfreeze and train the whole network. Usually, it is suggested to freeze till model convergence. Since we have initialized only 2 layers, and our dataset size is 100,000, we froze it for 2 epochs.
  • Approach 4: Transfer learning using gradient clipping -  where we limit the gradient to 1.0 that backpropagates to the entire network. This method stops NN from completely forgetting its imagenet knowledge due to the large gradients generated by the randomly initialized layers in the initial few steps. This is our alternative to the recommended way.
  • Approach 5: Transfer learning in a very unrecommended way, i.e. initialize pretrained weights and modify them right from the beginning. No gradient normalization and no weight freezing. This approach is unrecommended, theoretically, because the huge gradients from randomly initialized layers cause the NN to forget its pretrained knowledge, by moving too far and too quickly from the optimal state it was in.

Here is a validation accuracy summary of the clean and noisy validation image.


Here are our inferences from the above results:

  • Basic CNN (approach 1) beats an EfficientNet if pretrained weights are not used (approach 2).
  • Once an EfficientNet has been initialized with pretrained weights (approaches 3, 4 & 5), we see that accuracy is much better as compared to basic CNN (approach 1) and EfficientNet without pretrained weights (approach 2). It shows the importance of knowledge, derived from solving the imagenet task, to utilize the full potential of EfficientNet architecture.
  • It also shows that AI models that solve a much harder task can solve a relatively simpler doodle classification task. Does it mean that EfficientNet is the best and most efficient way to solve the doodle classification task? No, seeing that EfficientNet has ~4.65x parameters and still betters the basic CNN model by only 2%. A bigger basic CNN model could be a way forward if you want to build an NN that solves the doodle classification task most efficiently. And that’s what we had done for Thursday’s Doodle Race game.
  • The results also show that the various transfer learning ways (approaches 3, 4 & 5) are very close, with approach 4 narrowly winning. This is in contrast to what the theory suggests, the recommended way should win quickly and convincingly.
  • A very curious thing unfolds when you look at the noisy accuracies of the models. The best out-of-the-box noise robustness comes out of the model built using our alternative to the theoretically recommended way (approach 4).
  • What is shocking here, is the noise robustness of the model trained in the recommended way (approach 3). It is much worse as compared to unrecommended (approach 5) and our alternative (approach 4) and even the model built without knowledge derived from solving the imagenet task (approach 2). This shows how much emphasis the recommended way puts into solving for the exact dataset and down it comes crashing when the dataset is slightly modified, at least that's what the the above results indicate.

So, can we conclusively state from the above results that an unrecommended (approach 5) way works just as well as the recommended way? No, because, an argument can be made, that for a simple 10-class doodle classification task, a 100,000 fine-tune dataset is very big, and hence the huge, diverse, data itself contains enough information, to negate knowledge-forgetting that happens in the unrecommended (approach 5) fine-tuning strategy. To know this for sure, we repeated the same experiments by reducing the dataset from 10,000 to 100 images per category.

In this run too, we see that our alternative to the recommended way (approach 4) beats the theoretical recommended way (approach 3) convincingly and is very close to beating the unrecommended way (approach 5).

If we had seen the recommended way winning significantly in a low-dataset size setting we could conclude that the unrecommended way forgot important previous knowledge and hence the recommended way won.

Now, this should not stand as a testament to always completely ignoring the theory and the recommended way. This is because, perhaps, in this dataset, the pretrained model’s weights could have fallen together in such a way that knowledge forgetting was not happening significantly even when layers were initialized randomly. Also, the number of layers/neurons randomly initialized was much less compared to the number of unmodified and pretrained weight layers.

So, what have all these experiments been about?

Once again, these experiments have been about verifying the theoretically recommended way because we saw in our previous blog that it hurts performance in a few cases. These experiments have also been about, perhaps, finding a better alternative to the theoretically recommended way.

So, which way is the best way for transfer learning?

AI is a sum of lots of variables and not something we fully understand. The best bet, according to us, and these experiments, would be to clip the gradient and immediately start training all the layers, without ever freezing them for a few steps and just training the randomly initialized layers.

This conclusion is purely based on our own experiments and based on a few other use cases (this and this), that say, freezing layers actually HURT the performance with a large enough dataset.

Of course, if you can afford it, it’s better to experiment with all the above methods and more, but we would still start the experiments from the gradient clipping method.

Sunil Khedar

Sunil Khedar

CTO & Co-founder