Listen to this story
What fuel is to vehicles, data is for AI systems. The buzz surrounding data in AI has reached a fever pitch, particularly in the context of large language models — the likes of GPT-3, PaLM, Bloom, etc. — but there are several challenges that researchers are trying to solve that goes beyond data and large language models. “Data is abundant, but accessibility is one of the issues,” said Yoshua Bengio, in an exclusive conversation with Analytics India Magazine.
Further, he said that a significant issue is having suitable information for different tasks and environments. Bengio said there is very little data when it comes to this scenario. He believes that causality in neural networks can help solve this issue. “Interestingly, humans seem to be really good at dealing with the sparsity of data on a new task,” Bengio added.
However, to solve the data sparsity concerns, self-supervised learning models were initially introduced by Meta AI in response to the challenges from supervised learning models. One of the major issues was carrying labelled data, which is expensive and sometimes practically impossible. But supervised models face a bigger challenge of being loaded with poor-quality data—alongside scaling of models as it can be trained on mislabelled data—leading to more bias and false output.
LLMs are Encyclopaedic Thieves
“Companies have pretty much exhausted the amount of data that is available on the internet. So, in other words, the current large language models are trained on everything that is available,” said Bengio. For instance, ChatGPT which has managed to enthral the world by answering in a “human-adjacent” manner is based on the GPT-3.5 architecture, having 175B parameters.
According to BBC Science Focus, the model was trained using internet databases that included a humongous 570 GB of data sourced from Wikipedia, books, research articles, websites, web texts and other forms of content. To give you an idea, approximately 300 billion words were fed into the system.
The amount of text that humans produce is going to continue to increase but we’ve sort of reached a limit, Bengio believes. Further growth of systems like ChatGPT in terms of datasets is limited and they still don’t do as well as humans in many respects. So, it’s interesting to ask, what is speeding the demand of data that these systems are trained on, he further added.
Bengio further said, the magnitude of data that these systems need to get the competence is almost equal to a person reading every day, every waking hour, all their life, and then living 1000 lives. For instance, he said that machines know much more than a four-year-old or even an adult. However, they fail at reasoning. “LLMs are encyclopaedic thieves,” said Bengio, saying that they are not able to reason with that knowledge as consistently as humans.
Meanwhile, Yann LeCun told AIM that he believes that the prime issue is not the data unavailability, but how systems can’t take advantage of the available data. For example, the exposure to language an infant needs to learn the language is very small compared to the billions of texts or images that language models have to be exposed to in order to perform well.