Listen to this story
|
It all began with the reports of OpenAI’s latest model called Q*, which could reportedly solve maths problems and demonstrated superior reasoning capabilities.
Created by OpenAI’s chief scientist Ilya Sutskevar, one notable aspect of Q* is that its research incorporated the use of computer-generated or synthetic data. This is in contrast to methods that rely on real-world information, such as text or images sourced from the internet, as done in the training of GPT.
This has triggered a discussion within the tech community about whether synthetic data would lead to AGI.
Not everyone believes in synthetic data
Meta’s AI scientist Yann LeCun has a very different viewpoint from OpenAI and believes that the combination of LLMs and synthetic data may not necessarily lead to AGI.
To put his viewpoint across, he expressed dissatisfaction with OpenAI’s Q*. “Please ignore the deluge of complete nonsense about Q*. One of the main challenges to improve LLM reliability is to replace Auto-Regressive token prediction with planning,” he posted on X.
LeCun has been consistently vouching for quite some time that, in order to achieve AGI, the reasoning capability of LLMs needs improvement rather than simply bringing in more data.
Citing examples of animals and humans, he said that they get smarter with vastly smaller amounts of training data. LeCun is betting on new architectures that would learn as efficiently as animals and humans. “Using more data (synthetic or not) is a temporary stopgap made necessary by the limitations of our current approaches,” he added in his post on X.
To add to LeCun’s argument, Bojan Tunguz, machine learning scientist at NVIDIA, stated, “For tabular datasets, with which I have the most experience, synthetic data is worse than useless. I’ve heard similar stories from people who use it for training autonomous vehicles.”
Likewise, according to Jim Fan, senior AI Scientist at NVIDIA, synthetic data is anticipated to play a significant role, but will not be sufficient to get there to AGI just by blind scaling.
Moreover, unlike human generated data, which is limited in quantity, synthetic data will far surpass it. Musk remarked, “It’s a little sad that you can fit the text of every book ever written by humans on one hard drive (sigh). Synthetic data will exceed that by a zillion” bringing back the question whether LLMs can fit in enough data.
OpenAI might be onto something big
Two years ago, Andrej Karpathy, as Tesla’s head of AI and computer vision, began working with synthetic data for auto-labeling, involving the tagging of information in the images collected by Tesla’s fleet.
Interestingly, now at OpenAI, Andrej Karpathy might be on to something big, as his latest cryptic post, ‘X,’ said, “Thinking of centralisation and decentralisation lately”. He hinted that he might as well be thinking of building an AI system that uses both centralised and decentralised LLM models to give better results.
At AIM, we would like to refer to this new architecture as Hybrid LLMs, which might utilise synthetic data based on the requirements between two LLMs and may not necessarily involve the entire dataset.
Meanwhile, LeCun thinks that Q* might be OpenAI’s attempt at ‘planning’— which refers to a branch of AI that involves creating a sequence of actions or decisions to achieve a specific goal. Unlike some other machine learning approaches that focus on learning from data (like supervised learning), planning is more concerned with generating a series of steps or actions to reach a desired outcome.
Interestingly, OpenAI is exploring something similar to planning with Q-learning and PPO, for a model-free approach. Q-learning doesn’t require a pre-defined model, allowing the AI agent to autonomously learn and predict by iterating in the environment.
Here, synthetic data can be used to generate realistic training environments for Q-learning agents, which can help them to learn more effectively.
Furthermore, LeCun mentioned that OpenAI recently hired former Meta research scientist Noam Brown. Interestingly, two months ago, Brown posted on LinkedIn that OpenAI is hiring ML engineers for research on multi-step reasoning with LLMs.
He added that OpenAI recently achieved a new state-of-the-art result in math problem-solving (78% on the Hendrycks MATH benchmark), similar to what Q* recently accomplished. It is evident that synthetic data might require a new architecture, distinct from an LLM, to better enable reasoning and progress toward AGI.