Facebook’s New Billion-Parameter Model Might Just Change Computer Vision Forever

Facebook SEER - Computer Vision

Just like the human brain, deep learning uses a neural network for object detection, speech recognition, translation, decision-making, and more. However, for deep learning — a subset of machine learning — to work optimally, a massive amount of data is required. Reducing the data-dependency of deep learning is one of the top priorities of AI researchers.

Facebook vice president Yann LeCun, considered one of the godfathers of deep learning, presented the blueprint for self-supervised learning at the AAAI conference in 2020. In a recent blog, LeCun wrote: “Practically speaking, it’s impossible to label everything in the world. There are also some tasks for which there’s simply not enough labeled data, such as training translation systems for low-resource languages. If AI systems can glean a deeper, more nuanced understanding of reality beyond what’s specified in the training data set, they’ll be more useful and ultimately bring AI closer to human-level intelligence”.

In self-supervised learning, systems don’t rely on labelled data sets to train and perform tasks. Instead, they learn directly from the information directly fed to them–text, images etc. This approach has already been used in NLP, where self-supervised pretraining of huge models has led to breakthroughs in machine translation, natural language inference, and question-answering.

Now, with SEER (SElf-supERvised), Facebook has co-opted this approach for computer vision. SEER is a billion-parameter self-supervision computer vision model that can learn from any group of images on the internet. These images needn’t be curated and labelled, which are otherwise a prerequisite for most computer vision training.

What Is SEER?

Self-supervised learning in NLP models uses trillions of parameters and heavy datasets for training. A large amount of data ensures a superior model.

In NLP, semantic concepts can be broken down into discrete words, but computer vision is a lot trickier. Matching the pixel to its corresponding concept is quite a task as many images need to be assessed to understand the variation around a single concept. 

To efficiently scale models to work with complex and high-dimensional image data, two components are needed:

  • An algorithm that learns from a large number of random images with metadata or annotations
  • A convolutional network that can capture and learn every visual concept from given data.

To overcome these challenges, the team at Facebook adopted SwAV, an algorithm that groups images associated with similar concepts. With SwAV, the researchers were able to surpass the state-of-the-art algorithm’s performance at six times less training time.

Further, to train the model at such a large scale, researchers used RegNet, a Convolutional Networks-based deep learning algorithm capable of scaling up to trillions of parameters.

Credit: Facebook

All-Purpose Library For SEER

Facebook also open-sourced an all-purpose library for self-supervised learning called VISSL (VIsion library for state-of-the-art Self-Supervised Learning). It is a PyTorch-based library that allows self-supervised learning at both small and large scale. VISSL contains a benchmark suite and a model zoo with over 60 pre-trained models for comparing modern self-supervised learning methods.

VISSL has the following features:

  • Mixed precision from the NVIDIA Apex library that reduces memory requirements.
  • PyTorch’s gradient checkpointing helps in training models on large batch sizes.
  • The shared optimiser from the FairScale library that reduces memory usage
  • Optimisations for online self-supervised learning.

Wrapping Up

Self-supervised learning eliminates the need for human annotations and metadata. Other advantages include:

  • It enables the computer vision community to work with larger and more diverse data sets
  • Learn from unlabelled random images
  • Mitigate biases that may creep in with data curation
  • In cases such as medical imaging where there are limited datasets available, SEER can help in specialising models.
  • It enables faster and more accurate responses to rapid innovations in the field of computer vision.

Download our Mobile App

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox