Going Beyond Large Language Models (LLMs)

"LLMs are encyclopaedic thieves," says Yoshua Bengio
Listen to this story

What fuel is to vehicles, data is for AI systems. The buzz surrounding data in AI has reached a fever pitch, particularly in the context of large language models — the likes of GPT-3, PaLM, Bloom, etc. — but there are several challenges that researchers are trying to solve that goes beyond data and large language models. “Data is abundant, but accessibility is one of the issues,” said Yoshua Bengio, in an exclusive conversation with Analytics India Magazine.

Further, he said that a significant issue is having suitable information for different tasks and environments. Bengio said there is very little data when it comes to this scenario. He believes that causality in neural networks can help solve this issue. “Interestingly, humans seem to be really good at dealing with the sparsity of data on a new task,” Bengio added. 

However, to solve the data sparsity concerns, self-supervised learning models were initially introduced by Meta AI in response to the challenges from supervised learning models. One of the major issues was carrying labelled data, which is expensive and sometimes practically impossible. But supervised models face a bigger challenge of being loaded with poor-quality data—alongside scaling of models as it can be trained on mislabelled data—leading to more bias and false output. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

LLMs are Encyclopaedic Thieves

“Companies have pretty much exhausted the amount of data that is available on the internet. So, in other words, the current large language models are trained on everything that is available,” said Bengio. For instance, ChatGPT which has managed to enthral the world by answering in a “human-adjacent” manner is based on the GPT-3.5 architecture, having 175B parameters. 

According to BBC Science Focus, the model was trained using internet databases that included a humongous 570 GB of data sourced from Wikipedia, books, research articles, websites, web texts and other forms of content. To give you an idea, approximately 300 billion words were fed into the system.

The amount of text that humans produce is going to continue to increase but we’ve sort of reached a limit, Bengio believes. Further growth of systems like ChatGPT in terms of datasets is limited and they still don’t do as well as humans in many respects. So, it’s interesting to ask, what is speeding the demand of data that these systems are trained on, he further added. 

Bengio further said, the magnitude of data that these systems need to get the competence is almost equal to a person reading every day, every waking hour, all their life, and then living 1000 lives. For instance, he said that machines know much more than a four-year-old or even an adult. However, they fail at reasoning. “LLMs are encyclopaedic thieves,” said Bengio, saying that they are not able to reason with that knowledge as consistently as humans.

“My belief is that as more data is better, bigger networks are better. But, we’re still missing some important ingredients to achieve the kind of intelligence that humans have,” said Bengio. 

Meanwhile, Yann LeCun told AIM that he believes that the prime issue is not the data unavailability, but how systems can’t take advantage of the available data. For example, the exposure to language an infant needs to learn the language is very small compared to the billions of texts or images that language models have to be exposed to in order to perform well. 

Tasmia Ansari
Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.