MITB Banner

Synthetic Data Alone won’t Achieve AGI 

LeCun thinks that Q* might be OpenAI’s attempt at “Planning”

Share

Listen to this story

It all began with the reports of OpenAI’s latest model called Q*, which could reportedly solve maths problems and demonstrated superior reasoning capabilities.

Created by OpenAI’s chief scientist Ilya Sutskevar, one notable aspect of Q* is that its research incorporated the use of computer-generated or synthetic data. This is in contrast to methods that rely on real-world information, such as text or images sourced from the internet, as done in the training of GPT. 

This has triggered a discussion within the tech community about whether synthetic data would lead to AGI.

Not everyone believes in synthetic data

Meta’s AI scientist Yann LeCun has a very different viewpoint from OpenAI and believes that the combination of LLMs and synthetic data may not necessarily lead to AGI. 

To put his viewpoint across, he expressed dissatisfaction with OpenAI’s Q*. “Please ignore the deluge of complete nonsense about Q*. One of the main challenges to improve LLM reliability is to replace Auto-Regressive token prediction with planning,” he posted on X. 

LeCun has been consistently vouching for quite some time that, in order to achieve AGI, the reasoning capability of LLMs needs improvement rather than simply bringing in more data.

Citing examples of animals and humans, he said that they get smarter with vastly smaller amounts of training data. LeCun is betting on new architectures that would learn as efficiently as animals and humans. “Using more data (synthetic or not) is a temporary stopgap made necessary by the limitations of our current approaches,” he added in his post on X.

To add to LeCun’s argument, Bojan Tunguz, machine learning scientist at NVIDIA, stated, “For tabular datasets, with which I have the most experience, synthetic data is worse than useless. I’ve heard similar stories from people who use it for training autonomous vehicles.”

Likewise, according to Jim Fan, senior AI Scientist at NVIDIA, synthetic data is anticipated to play a significant role, but will not be sufficient to get there to AGI just by blind scaling. 

Moreover, unlike human generated data, which is limited in quantity, synthetic data will far surpass it. Musk remarked, “It’s a little sad that you can fit the text of every book ever written by humans on one hard drive (sigh). Synthetic data will exceed that by a zillion” bringing back the question whether LLMs can fit in enough data.

OpenAI might be onto something big 

Two years ago, Andrej Karpathy, as Tesla’s head of AI and computer vision, began working with synthetic data for auto-labeling, involving the tagging of information in the images collected by Tesla’s fleet. 

Interestingly, now at OpenAI, Andrej Karpathy might be on to something big, as his latest cryptic post, ‘X,’ said, “Thinking of centralisation and decentralisation lately”. He hinted that he might as well be thinking of building an AI system that uses both centralised and decentralised LLM models to give better results.

At AIM, we would like to refer to this new architecture as Hybrid LLMs, which might utilise synthetic data based on the requirements between two LLMs and may not necessarily involve the entire dataset.

Meanwhile, LeCun thinks that Q* might be OpenAI’s attempt at ‘planning’— which refers to a branch of AI that involves creating a sequence of actions or decisions to achieve a specific goal. Unlike some other machine learning approaches that focus on learning from data (like supervised learning), planning is more concerned with generating a series of steps or actions to reach a desired outcome.

Interestingly, OpenAI is exploring something similar to planning with Q-learning and PPO, for a model-free approach. Q-learning doesn’t require a pre-defined model, allowing the AI agent to autonomously learn and predict by iterating in the environment.

Here, synthetic data can be used to generate realistic training environments for Q-learning agents, which can help them to learn more effectively.

Furthermore, LeCun mentioned that OpenAI recently hired former Meta research scientist Noam Brown. Interestingly, two months ago, Brown posted on LinkedIn that OpenAI is hiring ML engineers for research on multi-step reasoning with LLMs.

He added that OpenAI recently achieved a new state-of-the-art result in math problem-solving (78% on the Hendrycks MATH benchmark), similar to what Q* recently accomplished. It is evident that synthetic data might require a new architecture, distinct from an LLM, to better enable reasoning and progress toward AGI.

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.