“VICReg could be used to model the dependencies between a video clip and the frame that comes after, therefore learning to predict the future in a video.”Adrien Bardes, Facebook AI Research
Humans have an innate capability to identify objects in the wild, even from a blurred glimpse of the thing. We do this efficiently by remembering only high-level features that get the job done (identification) and ignoring the details unless required. In the context of deep learning algorithms that do object detection, contrastive learning explored the premise of representation learning to obtain a large picture instead of doing the heavy lifting by devouring pixel-level details. But, contrastive learning has its own limitations.
According to Andrew Ng, pre-training methods can suffer from three common failings: generating an identical representation for different input examples (which leads to predicting the mean consistently in linear regression), generating dissimilar representations for examples that humans find similar (for instance, the same object viewed from two angles), and generating redundant parts of a representation (say, multiple vectors that represent two eyes in a photo of a face). The problems of representation learning, wrote Andrew Ng, boil down to variance, invariance, and covariance issues.
Also Read: What is Contrastive Learning
Andrew Ng’s observations are a reference to a new self-supervised algorithm released by the researchers at Facebook AI, PSL Research University, and New York University, along with Turing award recipient Yann Lecun introduced called Variance-Invariance-Covariance Regularization (VICReg), which builds on Lecun’s own Barlow Twins method.
The researchers designed VICReg (Variance-Invariance-Covariance Regularization) to avoid the collapse problem, which is handled more inefficiently in the case of contrastive methods. They do this by introducing a simple regularisation term on the variance of the embeddings along each dimension individually and combining the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularisation. The authors state that VICReg is performed on par with several state-of-the-art methods.
VICReg is a simple approach to self-supervised image representation learning, and its objectives are as follows:
- Learn invariance to different views with an invariance term.
- Avoid collapse of the representations with a variance regularisation term.
- Spread the information throughout the different dimensions of the representations with a covariance regularisation term.
The results show that VICReg performs on par with state-of-the-art methods and ushers a new paradigm of non-contrastive self-supervised learning.
What Authors Had To Say
Talking to Analytics India Magazine about VICReg’s significance, the lead author, Adrien Bardes, who is also a resident PhD student at Facebook AI Research, Paris, said that self-supervised representation learning is a learning paradigm that aims to learn meaningful representations of some unlabelled data. Recent approaches rely on Siamese networks and maximise the similarity between two augmented views of the same input. A trivial solution is for the network to output constant vectors, known as the collapse problem. VICReg is a new algorithm based on siamese networks but aims to prevent a collapse by regularising the variance and covariance of the network outputs. It achieves state-of-the-art results in several computer vision benchmarks while being a straightforward and interpretable approach.
When asked about how VICReg addresses shortcomings of contrastive learning methods, Bardes explained that contrastive learning methods are based on a simple principle. They make the inputs that should encode similar information close to each other in the embedding space and prevent a collapse by pushing apart the inputs that should encode dissimilar information. This process requires the mining of a massive amount of negative pairs, pairs of distinct inputs. Recent contrastive approaches for self-supervised learning have different strategies for mining these negative pairs; they can sample them from a memory bank, as in MoCo, or sample them from the current batch, as in SimCLR, which in both cases is costly in time or memory. VICReg, on the other hand, does not require these negative pairs; it implicitly prevents a collapse by enforcing the representations to be different from each other without making any direct comparison between different examples. It, therefore, does not require the memory bank of MoCo and works with much smaller batch sizes than SimCLR.
For Bardes, self-supervised learning is probably the most exciting topic in machine learning research. Annotating data is a very expansive process performed by humans who have biases and can make mistakes. It is, therefore, impossible to annotate the vast amount of data available today, for example, medical or astronomical data and images and videos on the Internet. Training models that leverage all these data can only be done using self-supervised learning. This is one of the motivations behind the development of VICReg.
Bardes believes that VICReg is applicable in any scenario where one wants to model the relationships within a data set. It can be used with any kind of data, images, videos, text, audio, or proteins. For example, you could use it to model the dependencies between a video clip and the frame after, therefore learning to predict the future in a video. Another example would be to understand the relationship between the graph of a molecule and its image seen from a microscope.
“We are at the early stages of the development of self-supervised learning. Shifting from contrastive methods to non-contrastive methods is the first step towards more practical algorithms. Current approaches rely on hand-craft data augmentations that can be viewed as a kind of supervision. The next step will probably be to get rid of these augmentations. Another promising direction consists in handling the uncertainties in modelling the data. Current methods are mostly deterministic and always model the same relation between two inputs. For example, if we go back to the frame prediction example, current methods would only model the possible future for a video clip. Future approaches will probably use latent variables that model the space of possible predictions,” concluded Bardes.