Last updated December 19, 2022
In AI Origins & Evolution

The Missing Link of Self-Supervised Learning

Data scarcity is seen as a major bottleneck to AI progress, but MetaAI’s Yann LeCun thinks otherwise.

Published on December 13, 2022

by Ayush Jain

Listen to this story

A great example of self-supervised learning is the way humans learn. We “learn from experience” and see the world around us. This can be done through experimentation, observation and tests. In a recent interaction with Analytics India Magazine, the guru of self-supervised learning Yann LeCun explained why such methods are key to the future of artificial intelligence.

Giving us the context from a human equivalence, LeCun said that the average human has the ability to process about ten images per second in a span of 100 milliseconds. By the time humans are five years old, they have already seen about a billion frames. Interestingly, Google, Instagram, and YouTube produce the same amount of images in hours. “We have more data than we can use, but we don’t know how to use it,” said LeCun, in the backdrop of some of the challenges faced by self-supervised learning models.

Limitations of self-supervised learning

If you look at the evolution of self-supervised learning models, it initially came into the picture in response to the challenges posed by supervised learning models. This includes carrying labelled data, which is expensive and sometimes practically impossible. As a result, from a purely pragmatic view of short-term applications, there is a huge push to deploy more powerful self-supervised learning models.

But, on the flip side, self-supervised models face a much bigger issue of being loaded with poor-quality data—alongside scaling of models as it has the possibility of being trained on mislabelled data—leading to more bias and false output.

But, LeCun says otherwise. He believes that the main issue is not the unavailability of data but how learning systems can take advantage of the available data. For example, the amount of exposure to language an infant needs to learn the language is quite small compared to the billions of words or pieces of text that language models have to be exposed to in order to perform well.

Similarly, when it comes to games like Chess or Go—which are designed to be difficult for humans—machines using reinforcement learning can do well. But, achieving such a feat requires enormous data equivalent to several lifetimes of full-time playing by humans. In a nutshell, machines are not very efficient at using data. A good way to progress here, according to LeCun, will be to discover new running patterns that allow machines to run with less data.

LeCun, in a recent tweet, said that the impact of self-supervised models has been much larger than he had predicted. The success of models like ChatGPT, text-to-anything generation along with the advancements made in protein folding models attest sufficiently to that.

Problems galore

Self-supervised learning as an ideal only works for large corporations like Meta, which possess terabytes of data to train state-of-the-art models. Additionally, there are several challenges when it comes to self-supervised learning. First, as opposed to a supervised learning model, the self-supervised model minimises the human’s role in the process. This means that there is a high chance that it would mislabel data, leading to errors in the output. Moreover, the costs of bad data have been hefty for businesses, with Gartner claiming out that—on an average—businesses lose nearly $9.7 million each year.

Contrary to LeCun’s claims, several researchers perceive that we may run out of data. For instance, in analysing the growth of dataset sizes in machine learning, Villalobos et al estimated that over the coming decades, the total stock of unlabelled data will be exhausted soon. Their projections suggest that by 2026, we will be nearing the end of high-quality data, whereas poor-quality data will last through anytime between 2030 and 2050. Thus, ML models’ growth might slow down unless data efficiency becomes a focus or new data sources are made available.

Read: What is Stopping Generative AI from Achieving Growth?

Likewise, Manu Joseph, creator of PyTorch Tabular, told AIM, “Collecting more data to train LLMs is a challenge since there is a shortage of good-quality test data, and most of the text on the internet is duplicated.”

However, the mountainous task of data efficiency is not yet a failed cause.

Solving one model at a time

Take, for instance, a recent study which showed that large language models (LLMs) can self-improve even with unlabelled datasets. Prior to this study, the latest research portrayed that fundamentally improving the model performances above few-shot baselines still necessitates fine-tuning a sizable number of high-quality supervised datasets. According to the study, however, the models could enhance their performance on reasoning datasets by training on their own generated labels, given input questions only. The research also shows that an LLM can self-improve even on its own generated questions and few-shot Chain-of-Thought prompts.

Further, the Deepmind research team recently released a paper showcasing that its Epistemic Neural Networks (ENNs) enable fine-tuning large models with 50% less data. Beyond the traditional framework of Bayesian neural networks, the team introduced ENNs—designed using an epinet—which is “an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty.” The researchers claim that ENNs will considerably improve the tradeoffs in prediction quality and computation.

An important issue with LLMs, they highlight, is that they cannot distinguish irreducible uncertainty over the next token. So, the team takes a different approach, relying on the epinet’s uncertainty estimations to assist models to “know what they don’t know” and increase their data efficiency to overcome current approaches to the problem that generally requires adding more training data.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Ayush Jain

Ayush is interested in knowing how technology shapes and defines our culture, and our understanding of the world. He believes in exploring reality at the intersections of technology and art, science, and politics.

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Meta Llama 3 Now Available on Microsoft Azure

Meta Releases MEGALODON, Efficient LLM Pre-Training and Inference on Infinite Context Length

Meta Releases AI on WhatsApp, Looks Like Perplexity AI

Patchscopes Could be an Answer to Understanding LLM Hallucinations

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

The Impact of Lok Sabha Election on India’s AI Progress

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

The model is available on Hugging Face.

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI

Is it Humane to Bash Humane Ai Pin?

“People don’t know what they want until you show it to them.”

Meta Llama 3 Now Available on Databricks For Enterprise

Llama 3 models are now also rolling out on Amazon SageMaker, Google Cloud, Hugging Face,

How Databricks is Enabling Agriculture’s Data Revolution with UPL

Databricks is aiding UPL’s sustainable agriculture products and solutions in 138 countries with capabilities like

How Good is Llama 3 for Indic Languages?

“Llama 3 Dhenu = Mom, bring 3 cows”

OpenAI Hires Pragya Misra As Its First Employee in India

OpenAI is also looking to set up a local team in India.

India is Making its Own AI Servers

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Even Meta’s open-source model, Llama 3, with 400B (the GPT-4 equivalent), has not been released for similar reasons.

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Infosys Feels Good About Its Work with Generative AI

Mohit Pandey

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Dell Technologies Unveils High-performance APEX File Storage for Microsoft Azure Customers

Gopika Raj

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the