Advertisement

Going Beyond Large Language Models (LLMs)

"LLMs are encyclopaedic thieves," says Yoshua Bengio
Listen to this story

What fuel is to vehicles, data is for AI systems. The buzz surrounding data in AI has reached a fever pitch, particularly in the context of large language models — the likes of GPT-3, PaLM, Bloom, etc. — but there are several challenges that researchers are trying to solve that goes beyond data and large language models. “Data is abundant, but accessibility is one of the issues,” said Yoshua Bengio, in an exclusive conversation with Analytics India Magazine.

Further, he said that a significant issue is having suitable information for different tasks and environments. Bengio said there is very little data when it comes to this scenario. He believes that causality in neural networks can help solve this issue. “Interestingly, humans seem to be really good at dealing with the sparsity of data on a new task,” Bengio added. 

However, to solve the data sparsity concerns, self-supervised learning models were initially introduced by Meta AI in response to the challenges from supervised learning models. One of the major issues was carrying labelled data, which is expensive and sometimes practically impossible. But supervised models face a bigger challenge of being loaded with poor-quality data—alongside scaling of models as it can be trained on mislabelled data—leading to more bias and false output. 

LLMs are Encyclopaedic Thieves

“Companies have pretty much exhausted the amount of data that is available on the internet. So, in other words, the current large language models are trained on everything that is available,” said Bengio. For instance, ChatGPT which has managed to enthral the world by answering in a “human-adjacent” manner is based on the GPT-3.5 architecture, having 175B parameters. 

According to BBC Science Focus, the model was trained using internet databases that included a humongous 570 GB of data sourced from Wikipedia, books, research articles, websites, web texts and other forms of content. To give you an idea, approximately 300 billion words were fed into the system.

The amount of text that humans produce is going to continue to increase but we’ve sort of reached a limit, Bengio believes. Further growth of systems like ChatGPT in terms of datasets is limited and they still don’t do as well as humans in many respects. So, it’s interesting to ask, what is speeding the demand of data that these systems are trained on, he further added. 

Bengio further said, the magnitude of data that these systems need to get the competence is almost equal to a person reading every day, every waking hour, all their life, and then living 1000 lives. For instance, he said that machines know much more than a four-year-old or even an adult. However, they fail at reasoning. “LLMs are encyclopaedic thieves,” said Bengio, saying that they are not able to reason with that knowledge as consistently as humans.

“My belief is that as more data is better, bigger networks are better. But, we’re still missing some important ingredients to achieve the kind of intelligence that humans have,” said Bengio. 

Meanwhile, Yann LeCun told AIM that he believes that the prime issue is not the data unavailability, but how systems can’t take advantage of the available data. For example, the exposure to language an infant needs to learn the language is very small compared to the billions of texts or images that language models have to be exposed to in order to perform well. 

Download our Mobile App

Tasmia Ansari
Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Is Sam Altman a Hypocrite? 

While on the one hand, Altman is advocating for the international community to build strong AI regulations, he is also worried when someone finally decides to regulate it