Today, the quantity of generated data and the complexity of annotating it is increasing tremendously. To resolve the issue of annotation, self-supervised learning methods come into the picture. Self-supervised models can learn better from the raw data, making it one of the most important areas of AI research today. There are a few methods to train machines without annotated data. Chief Scientist at Meta, Yann LeCun, recently tweeted, sharing his preference for non-contrastive learning. Analytics India Magazine has analysed this long-standing debate: contrastive learning or non-contrastive learning?
Contrastive learning is a machine learning approach to finding similar and dissimilar information from a dataset for an algorithm. It is also a classification algorithm where the data is classified based on similarity and dissimilarity. Contrastive methods learn representations by contrasting positive and negative examples. Past research has proved a great empirical success in computer vision tasks using contrastive pre-training. For instance, Hénaff et al., 2019, evaluated contrastive methods trained on unlabelled ImageNet data on a linear classifier and found it to surpass the accuracy of supervised AlexNet. Similarly, He et al., 2019, found contrastive pre-training on ImageNet to effectively transfer to other downstream tasks and outperform the supervised pre-training counterparts.
The contrastive method learns representations by minimising the distance between two views of the same data point and maximising views from different data points. Essentially, it minimises the distance between positive data to a minimum and maximises the distance between negative data to a maximum.
For example, suppose the model has to differentiate between a cat and a dog. In that case, it will do so by recognising the similarities and differences between the animals by identifying data points as similar and different. The programmers can perform augmentation combinations in the training data to pose similar images presenting different versions of the same image. Later, these are fed into vector representations for each image, training the model to similar output representations for similar images so it can differentiate a cat from a dog. As illustrated in this post, it should recognise cats have pointy ears while dogs have droopy ears.
Contrastive learning in self-supervised vs supervised models/ GoogleAI
What is the dimensional collapse in contrastive learning
Google AI explained the positive and negative in contrastive learning, “These contrastive learning approaches typically teach a model to pull together the representations of a target image (a.k.a., the “anchor”) and a matching (“positive”) image in embedding space, while also pushing apart the anchor from many non-matching (“negative”) images.” Since labels are unavailable, the positive can be an augmentation of the anchor, and the negatives are chosen to be the other samples from the training minibatch. Given the random sampling, false negatives can cause a degradation in the representation quality. Facebook AI Research further noted the positive-negatives as the loss function of contrastive learning. “(It) is intuitively simple: minimise the distance in representation space between positive sample pairs while maximising the distance between negative sample pairs,” the team said.
In the paper worked on by Yann LeCun, contrastive learning can lead to dimensional collapse, “whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space”, the study explained. While, in theory, the positive and negative pairs in the contrastive approach should allow the negative to repulse and prevent the effect of dimensional collapse, the research proved otherwise. In contrastive learning, all the embedding vectors fall into a lower-dimensional subspace instead of the entire available embedding space because of two main mechanisms:
* strong augmentation along feature dimensions
* implicit regularisation driving models toward low-rank solutions
Lack of collapse in non-contrastive self-supervised techniques
On the contrary, to collapse in contrastive methods, FAIR identified non-contrastive methods to suffer from a lesser collapse problem of a different nature. The study cited alternative approaches used by researchers in papers like Grill et al. (2020) and Chen & He (2020), who used stop-gradient and extra predictor to prevent collapse without negative pairs and Caron et al. (2018; 2020), who used an additional clustering step in their process. Unlike contrastive methods and their high reliance on a large quantity of negative samples, non-contrastive methods do not directly rely on explicit negative samples. Instead, the dynamics of the alignment of eigenspaces between the predictor and its input correlation matrix play a key role in preventing complete collapse.
What are non-contrastive self-supervised techniques?
The non-contrastive approach only relies on positive sample pairs. For instance, FAIR demonstrated this as the training data containing two versions of a cat picture, the original in colour and another in black and white. There is no inclusion of negative examples, like an unrelated photo of a mountain. While, in theory, this might be counterintuitive and the model trained on only positive samples is bound to collapse, FAIR found their ability to learn good representations regardless of the lack of negative examples. “We’ve found the training of a non-contrastive self-supervised learning framework converges to a useful local minimum but not the global trivial one. Our work attempts to show why this is,” the team stated.
The non-contrastive approach uses an extra predictor and a stop-gradient operation. Two popular non-contrastive methods, BYOL and SimSiam, have proved the need for the predictor and stop-gradient in preventing a representational collapse in the model.
Unlike contrastive, the non-contrastive approach is simpler, based on optimising a CNN to extract similar feature vectors for similar images. They learn representations by minimising the distance between two views of the same image. In the cat example, the algorithm would detect characteristics like eyeballs, fur, paws and whiskers to relate to the cat.