In our daily routine, we unknowingly perfectly transfer the knowledge of some activity or task to the related one. Whenever we come across a new problem statement or task, first we recognize it and try to apply the relevant experience which results in hassle-free completion of the task. Following this same approach, a term called Transfer Learning is used in the field of deep learning which facilitates the use of already trained models to the related applications. Here we are going to discuss in detail the transfer learning that is popularly used in the field of deep learning along with different models used in this domain along with a critical investigation of their features.
Table of Contents
- Understanding the Transfer Learning
- What is Transfer Learning?
- Basic Applications of Transfer Learning
- When can it fail?
- ImageNet Dataset
- Common used Transfer Learning Models
- VGG Family
- Feature Comparison of Transfer Learning Models
- Comparision of Models with respect to Classification Accuracy
Understanding the Transfer Learning
What is Transfer Learning?
Transfer learning is a method for feature representation from a pre-trained model facilitating us that we don’t need to train a new model from scratch. A pre-trained model is usually trained on a huge dataset such as ImageNet and the weights obtained from the trained model can be used for any other related application with your custom neural network. These newly built models can directly be used for predictions on relatively new tasks or can be used in training processes for related applications. This approach not only reduces the training time but also lowers the generalization error.
Basic Applications of Transfer Learning
Some of the common applications of transfer learning can be seen as pre-trained models trained on ImageNet can be used for real-world image-based classification problems; this is achieved because the model is trained on 1000 classes. Let’s say you are developing software that predicts a kind of flower, there you can use a pre pre-trained model to predict the kind of flower. Similarly for classifying text, it requires knowledge of word representations in some vector space but training vector space from scratch is challenging and time-consuming. Here you can use pre-trained word embeddings like GloVe for your development.
When can it fail?
Transfer learning will not work when the features learned by the bottom layer (classification layer) are not sufficient to differentiate the classes for your given problem set. When the dataset is not similar then the features are transferred poorly. As we do in the conventional model building, i.e., we remove some layers for accuracy purposes while if you follow the same in the Transfer learning means you are reducing trainable parameters which can result in over fittings. If you still insist on doing so, this will result in a time-consuming process.
Further in this article, we will take a look at some popular pre-trained architectures including the VGG family, ResNet, Inception and Xception. All these models are trained on the ImageNet dataset and can be realised with frameworks like Tensorflow, Keras, Pytorch.
It is a large dataset of annotated photographs intended for computer vision research work. This dataset consists of about 14 million images, more than 21000 groups or classes and more than 1 million images that have bounding box annotation. Whenever we hear about ImageNet in the context of deep learning we are likely to refer to a challenge called ImageNet large scale visual recognition Challenge for short ILSVRC. The goal of this challenge is to train a model that can correctly classify an image into a class out of 1000 separate object categories.
When it comes to image classification, the ImageNet challenge acts as the benchmark standard for all computer vision-based algorithms. During the challenge, the leader board is mostly dominated by CNN and deep learning techniques.
Commonly used Transfer Learning Models
Now, we will discuss the popular and commonly used models in transfer learning. Most of these models that we will discuss further are used in the task of image classification.
The Inception microarchitecture was introduced by Szegedy in 2014 in their paper Going deeper with convolution the complete architecture with dimension reduction looks like as given below;
The goal of this module is to act as a multi-level feature extractor by computing 1×1, 3×3, and 5×5 convolution within the same module of the network. The output of these filters is then stacked along the channel dimension and before being fed into the next layer in the network.
The architecture of this model includes:
- 1×1 convolution with 128 filters for dimensions and reductions and rectified linear activations
- Fully connected layer with 1024 units and a rectified linear activation
- Dropout layer with 70% ratio
- Linear layer with softmax loss as the classifier
Originally this architecture was called GoogleNet subsequently it has simply been called InceptionN where N refers to the version of the model.
- As multiple Conv filters applied on the same input result in multi-level feature extraction and reduce computational cost.
- High-performance gain on CNN.
- Trains faster than the family of VGG
- The size of the model is relatively smaller than VGG, where VGG can weigh up to 500MB’s inception is about 100MB’s
It does not have immediate disadvantages but further improvements in the architecture are proposed such as increasing the limit of divergence which has as done in the subsequent architecture i.e., exception.
This model was proposed by Francois Chollet the creator and maintainer of the Keras library. The Xception is an extension of inception architecture that replaces the standard inception model with depth wise separable convolutions.
From the below architecture, it is clear that Xception is a linear stack of depthwise separable convolution layers with residual connections. This makes architecture very easy to define and modify; it takes only 40 lines of code by using high-level APIs such as Keras or Tensorflow.
As you can see below, the data first goes through the Entry flow, then through the middle flow which is repeated eight times and finally through the exit flow. All convolutional layers follow batch normalization.
Linear stack layers make training faster than Inception, as it contains the same parameters as inception; it slightly outperforms the ImageNet dataset compared to inception and with a high margin on the JFT dataset (Google’s internal dataset). Performing better with almost the same parameter is a key advantage of this module.
This model was first proposed by Zisserman and Simonyan from the Visual Geometry Group (VGG) of the University of Oxford in their paper Very deep convolutional networks for large scale image recognition. The network is recognized by its simplicity using only a stack of 3×3 convolutional layers on top of each increasing depth and volume size handled by the max-pooling layers. Two fully connected layers each with 4096 nodes are then followed by a softmax layer.
During training the input to the model is fixed-sized RGB images and only preprocessing is done at the training is subtracting the mean RGB values computed on the training set for each pixel. The image is then passed through the stack of convolutional layers where it uses filter very small receptive files of 3×3 which is fair enough to capture the smallest notation. Spatial pooling is carried out by five max-pooling layers which follow some convolutional layer over a 2×2 pixel with strides of 2 and the last is usually the same for all architectures, i.e., softmax.
As the depth of the network increases, it becomes slow at training and network architectures themselves also become quite large.
The architectures of all the VGG variants are shown below with their parameters:
- Easy to understand and explainable
- It is good for classical problems like cats vs dogs classification to achieve a baseline of about 80%
- The larger number of weights parameters resulting in high inference time
Unlike traditional sequential networks such as AlexNet, OverFeat, and VGG, ResNet is a form of exotic architecture that relies on microarchitecture modules also known as a network in the architecture. Microarchitecture refers to the set of building blocks used to construct the entirely new network.
First introduced by Kaiming He in 2015, it has become seminal work demonstrating that extremely deep networks can be trained using standard SGD through the use of residual modules further accuracy can be obtained by updating the residual module to use identity mapping as demonstrated in this paper.
Even the ResNet is much deeper than the family of VGG, the actual model weights size is smaller due to the use of global average pooling rather than the fully connected layer this make model size for its variant ResNet50 which is used more popularly to 100MB’s
Below you can see the original architecture of the most basic ResNet model, i.e, ResNet-12.
(Architecture of ResNet-12, a basic model. Image Source)
- Helps to tackle the problem of vanishing gradient
- Accelerate the speed of training
- Gives higher accuracy specially for classification problem
- It tries to learn the difference between the learned features and if the learned feature is not useful in the final decision weights become zero for the particular feature.
- Increased complexity of architecture
- Adding skip connections which may take into account dimensionality between layers.
Feature Comparison of Different Transfer Learning Model
Let us summarize each of these models by comparing their important features against each other. This comparison is presented in the below table.
|Module||No. of Parameters||Complexity||Speed||Unique Feature|
|Inception||23.62 million||Low||High||Replace large filters with small|
|Xception||22.85 million||Low||High||Depth wise convolution followed by point wise convolution|
|VGG||138 million||High||Low||Uses small size of kernels|
|ResNet||23 million||Low||High||Identify mapping based on skip connections|
Comparison of Models with respect to Classification Accuracy
When these models were applied for classification using the ImageNet dataset, their accuracies in top-1 prediction top-5 predictions were obtained and a comparison of these accuracies is presented in the below table.
|Module||Top-1 Accuracy||Top-5 Accuracy|
By going through the comparisons as presented in both the tables above, we can say that the Xception model looks to outperform its peers not only in terms of features but also in terms of classification accuracy.
In this article, we have seen what transfer learning is along with its use case and we have also seen where it can fail. Later we discussed the architectures of some models including the family of VGG mentioning their merits and demerits. We also went through the comparisons of their features along with comparison w.r.t. Top-1 and Top-5 accuracies. Practical realisation and comparison of VGG and ResNet can be seen in this article with their full implementation on real-life problems.