Now Reading
10 Self-Supervised Learning Frameworks & Libraries To Use In 2021

10 Self-Supervised Learning Frameworks & Libraries To Use In 2021

Amit Raja Naik
  • The latest examples of self-supervision include Facebook’s DINO and ViSSL; Google’s SimCLR; OpenSelfSup and SfMLearner, etc.

Self-supervised learning is gathering steam, slowly but surely. A relatively new technique, self-supervised learning is nothing but training unlabeled data without human supervision. Yann LeCun described it best: Reinforcement learning is like a cherry on a cake, supervised learning is the icing on the cake, and self-supervised learning is the cake. In self-supervised or unsupervised learning, the system learns to predict part of its input from already existing inputs, he said.

Source: GitHub  

Register for Analytics Olympiad 2021>>

Most tech evangelists liken self-supervised learning models to young children, always curious and learning new information from observation. The latest examples of self-supervision include Facebook’s DINO and ViSSL (Vision library for Self-Supervised Learning); Google’s SimCLR; OpenSelfSup and SfMLearner, etc. 

Below, we have curated a list of the most popular self-supervised learning models, frameworks, and libraries.


DINO, a self-supervised learning vision transformers (ViT), is used to segment unlabelled and random images and videos without supervision. In other words, self DIstillation with NO labels. The model generates high accurate segmentation with self-supervised learning and suitable architecture. Also, DINO requires limited computing resources to train models. 


Lightly is a computer vision framework for self-supervised learning. It helps in understanding and filtering raw image data and can be applied before any data annotation step. The learned representations can further analyse and visualise datasets, alongside selecting a core set of samples. 


s3prl is an open-source toolkit that stands for Self-Supervised Speech Pre-training and Representation Learning. Self-supervised speech pre-trained models are called upstream in this toolkit and are used in multiple downstream tasks. 


SimCLR is a Simple framework for Contrastive Learning of Visual Representations. In its latest version (SimCLRv2), the distilled or self-supervised models have been used. It is primarily used for image segmentation and image classification.


OpenSelfSup is an Open-source unsupervised or Self Supervised representation learning toolbox based on PyTorch. It follows a similar code architecture of MMDetection, and is very flexible as it integrates various self-supervised tasks, including classification, feature learning, joint clustering and contrastive learning.


SfMLearner is a self-supervised learning framework used for tracking depth and ego-motion estimation from monocular videos. 

See Also


BYOL or ‘Bootstrap Your Own Latent’ is a new approach to self-supervised image representation learning on PyTorch. It is one of the simple methods for self-supervised learning that achieves cutting edge results without constructive learning and having to design negative pairs. The repo offers a module that lets researchers build an image-based neural network right away from unlabelled or random image data. 


DIG or Dive Into Graphs is a turnkey library that provides a unified testbed for higher level, research-oriented graph deep learning tasks like graph generation, self-supervised learning, explainability and 3D graphs. DIG also enabled researchers to develop their methods within its extensible framework and compare with existing baseline methods using standard datasets and evaluate metrics seamlessly. 


VL-BERT library is used to pre-train generic visual-linguistic tasks, including visual commonsense reasoning, visual question answering, and referring expression comprehension. 


EpipolarPose is a self-supervised learning method for 3D human pose estimation using multi-view or epipolar geometry. It does not need any 3D ground-truth data or camera extrinsic. While training, EpipolarPose estimates 2D poses from multi-view images and then utilises epipolar geometry to obtain a 3D pose and camera geometry. It takes an RGB image to produce a 3D pose result.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top