In the present scenario, the amount of generated data is increasing and the complexity of data annotation is also increasing. To resolve the issue of annotation, self-supervised learning methods come into the picture. Self-supervised models can learn better from the raw data. In this article, we are going to discuss a type of self-supervised learning which is known as contrastive self-supervised learning (contrastive SSL). The methods in contrastive self-supervised build representations by learning the differences or similarities between objects. The major points to be discussed in this article are listed below.
Table of Contents
- About Self-Supervised Learning
- What is Contrastive Learning?
- Non-Contrastive Vs Contrastive Self-Supervised Learning
- Contrastive Self-Supervised Learning in Detail
- Examples of Contrastive Self-Supervised Learning
About Self-Supervised Learning
Sign up for your weekly dose of what's up in emerging technology.
Self-supervised learning is considered a part of machine learning which is helpful in such situations where we have data with unlabeled information. We can say that it is a process between supervised learning and unsupervised learning. Usually, we find this type of learning based on neural networks.
We know very well about the learning capabilities of neural networks. In self-supervised learning, neural networks can learn in two steps:
Download our Mobile App
- To initialize the weight of the networks, problems with false labels can be solved.
- The actual task of the process can be performed by supervised or unsupervised learning.
When we talk about the results, we have seen various promising and accurate results in recent years and various large companies like meta and Google are using this type of learning process for image, video, and audio processing.
The basic idea behind self-supervised learning is to train the algorithms with the lower quality data, where other learning processes are focused on improving the final outcome of the algorithms. self-supervised learning methods can roughly be divided into two classes methods:
- Contrastive self-supervised learning
- Non-contrastive self-supervised learning
In this article, we are focused on contrastive self-supervised learning, so it becomes necessary to understand what is contrastive learning, which is explained in the next section of the article.
What is Contrastive Learning?
In machine learning, we use a similar kind of data for training the algorithms under it. And when we talk about well-labelled data, it is easy for machine learning algorithms to get trained on it. In any case, the quality of data is not appropriate for training the machine learning algorithms. We use contrastive learning for finding good quality data.
We can say that contrastive learning is an approach to finding similar and dissimilar information from a dataset for a machine learning algorithm. We can also consider contrastive learning as a classification algorithm where we are classifying the data on the basis of similarity and dissimilarity.
In the algorithm of contrastive learning, inner workings can be completed by learning an encoder f such that:
Score(f(x),f(x+)) >> score(f(x),f(x-))
- x+ can be considered as a positive sample which similar to x
- x- can be considered as a negative sample which is dissimilar to x
- Score is a function for measuring the similarity level between two samples
Using a softmax function, we can classify between similar and dissimilar samples accurately. There are various examples we have seen for this type of approach and one of the major examples is the framework SimCLR by the Google AI team.
The above image is a representation of SimCLR in which CNN and MLP layers are trained simultaneously. Training of these layers is responsible for validating projections that are similar for different augmented versions of the same image.
Here in the above example, we have seen what contrastive learning is. Let us distinguish between non-contrastive and contrastive self-supervised learning to understand the topic in more detail.
Non-Contrastive vs Contrastive Self Supervised Learning
|Contrastive Self-Supervised Learning||Non-Contrastive Self-Supervised Learning|
|1||Contrastive SSL uses both positive and negative samples from the data||Non-contrastive SSL uses only positive samples from the data|
|2||The distance between the positive samples is minimized in contrastive SSL||Non-contrastive SSL works on the useful local minimum from the data|
|3||In this learning, backpropagation can be utilized without any extra predictor||In this learning, an extra predictor is required on the current state for backpropagation|
|4||Networks in this learning are more complex and can be considered as the group of networks||This learning approach has less complicated neural networks or we can say the networks under this learning are simple linear networks.|
Contrastive Self-Supervised Learning in Detail
In the above section, we have discussed that the major goal of self-supervised learning is to learn from the lower quality data and the goal of contrastive learning is to distinguish between similar data and dissimilar data. Also, if we talk about the classes of self-supervised learning, we find that contrastive self-supervised learning is a class of self-supervised learning. This mainly implies the answer to the question “ what is a good representation of the data?” in self-supervised learning.
Let’s take an example of a computer vision domain where the task for self-supervised learning is to learn the visual representation of the image data and the problem in front of us to find the answer to the question “what is good visual representation? In such a situation, anyone can say that the answer is” a representation which can be used easily in the downstream task”. In most cases, we see the representation from the self-supervised learning is applied to the algorithms for downstream tasks such as face recognition and object detection. The representation is evaluated by the performance of the downstream tasks. During this process, we get important and useful information about the learned representation but we don’t get any feedback like “why we get such performance in the downstream tasks”. Using contrastive self-supervised learning we can obtain intuition and conjectures for the efficiency of the learned representation.
For a better representation of the process and data, we use contrastive learning with self-supervised learning. To understand the representation we are required to have some of the fundamental knowledge of representation.
- Invariances measurement: Invariance to the categories of the data is a crucial component of the representation. If we take the example of computer vision, we consider the invariance to the transformations as the component of representation. This component is very helpful for representation to be applied on the downstream tasks.
If we talk about a good representation of visual data, we say that the representation should be mostly invariant to all the transformations. mathematically, if the function of representation is h(x) should be invariant to the transformation t : x → x if h(t(x)) = h(x).
- Augmentation: In contrastive learning, we have seen that it works with both positive and negative samples and the focus of the procedure is to find positive samples from the data so that it can be fed to the downstream task-oriented algorithm. Most of the time we see that network for contrastive learning in training time uses the augmented data from the training data. For example, if we talk about the computer vision domain, randomly cropped part of the images is used as the positive pairs which is an essential procedure for matching the features of partially visible images. And the process is responsible for providing a high-quality result from the contrastive SSL by augmentation of the data. So the measurement of the augmentation level becomes a crucial component for understanding the representation.
- Dataset Biases: In machine learning, we are required to train the model with the training set using any type of learning. Here also the contrastive SSL approaches get trained on different datasets and the effects we see in the training can be caused by the bias data. Effects can be positive or negative. Which also affects the representation of the data. In computer vision, mostly contrastive SSL approaches get trained on the ImageNet dataset. Where images in the dataset are object-centric biased. Representations that do not differentiate biases can achieve seemingly enhanced performances.
Here in the article, we have seen the basic intuition behind contrastive self-supervised learning and how representation and its component affects the process. Let’s take a look at some of the examples of the frameworks which provide a facility of contrastive self-supervised learning.
Some examples of contrastive self-supervised learning are listed below:
- MoCO (Momentum Contrast) is a framework for visual representation learning. The work behind this was done by K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick.
- PIRL(Self-Supervised Learning of Pretext-Invariant Representations) has the goal to build image representation that is meaningful and does not require a high amount of training samples of images. This work is performed by Ishan Misra and Laurens van der Maaten.
- Google’s SimCLR, advances state of the art on self-supervised, semi-supervised learning and image classification.
- WAV2VEC, is a model for unsupervised pre-training for speech recognition by learning representations of raw audio developed by Facebook AI research.
Here in this article, we have discussed an overview of self-supervised learning and contrastive learning. By merging them we can make it contrastive SSL, which is also a part of self-supervised learning. The quality of providing meaningful representation of the data and procedure makes the contrastive SSL different from self-supervised learning.