Just like the human brain, deep learning uses a neural network for object detection, speech recognition, translation, decision-making, and more. However, for deep learning — a subset of machine learning — to work optimally, a massive amount of data is required. Reducing the data-dependency of deep learning is one of the top priorities of AI researchers.
Facebook vice president Yann LeCun, considered one of the godfathers of deep learning, presented the blueprint for self-supervised learning at the AAAI conference in 2020. In a recent blog, LeCun wrote: “Practically speaking, it’s impossible to label everything in the world. There are also some tasks for which there’s simply not enough labeled data, such as training translation systems for low-resource languages. If AI systems can glean a deeper, more nuanced understanding of reality beyond what’s specified in the training data set, they’ll be more useful and ultimately bring AI closer to human-level intelligence”.
Sign up for your weekly dose of what's up in emerging technology.
In self-supervised learning, systems don’t rely on labelled data sets to train and perform tasks. Instead, they learn directly from the information directly fed to them–text, images etc. This approach has already been used in NLP, where self-supervised pretraining of huge models has led to breakthroughs in machine translation, natural language inference, and question-answering.
Now, with SEER (SElf-supERvised), Facebook has co-opted this approach for computer vision. SEER is a billion-parameter self-supervision computer vision model that can learn from any group of images on the internet. These images needn’t be curated and labelled, which are otherwise a prerequisite for most computer vision training.
What Is SEER?
Self-supervised learning in NLP models uses trillions of parameters and heavy datasets for training. A large amount of data ensures a superior model.
In NLP, semantic concepts can be broken down into discrete words, but computer vision is a lot trickier. Matching the pixel to its corresponding concept is quite a task as many images need to be assessed to understand the variation around a single concept.
To efficiently scale models to work with complex and high-dimensional image data, two components are needed:
- An algorithm that learns from a large number of random images with metadata or annotations
- A convolutional network that can capture and learn every visual concept from given data.
To overcome these challenges, the team at Facebook adopted SwAV, an algorithm that groups images associated with similar concepts. With SwAV, the researchers were able to surpass the state-of-the-art algorithm’s performance at six times less training time.
Further, to train the model at such a large scale, researchers used RegNet, a Convolutional Networks-based deep learning algorithm capable of scaling up to trillions of parameters.
All-Purpose Library For SEER
Facebook also open-sourced an all-purpose library for self-supervised learning called VISSL (VIsion library for state-of-the-art Self-Supervised Learning). It is a PyTorch-based library that allows self-supervised learning at both small and large scale. VISSL contains a benchmark suite and a model zoo with over 60 pre-trained models for comparing modern self-supervised learning methods.
VISSL has the following features:
- Mixed precision from the NVIDIA Apex library that reduces memory requirements.
- PyTorch’s gradient checkpointing helps in training models on large batch sizes.
- The shared optimiser from the FairScale library that reduces memory usage
- Optimisations for online self-supervised learning.
Self-supervised learning eliminates the need for human annotations and metadata. Other advantages include:
- It enables the computer vision community to work with larger and more diverse data sets
- Learn from unlabelled random images
- Mitigate biases that may creep in with data curation
- In cases such as medical imaging where there are limited datasets available, SEER can help in specialising models.
- It enables faster and more accurate responses to rapid innovations in the field of computer vision.