Last updated April 24, 2024
In AI News & Update

‘We May be Able to Create an Infinite Data Generation Engine with Synthetic Data,’ says Anthropic CEO

“If you remember AlphaGo, note that the model there just trains against itself, using only the rules of Go to adjudicate,” said Anthropic CEO Dario Amodei.

Share

Published on April 24, 2024

by Donna Eva

Listen to this story

Despite the scepticism about producing quality data using synthetic data, Anthropic chief Dario Amodei recently believes that creating an infinite data generation engine that can help build better AI systems is possible.

“If you do it right, with just a little bit of additional information, I think it may be possible to get an infinite data generation engine,” said Amodei in an interview with CNBC while discussing the challenges and potential of using synthetic data to train AI models.

“We are working on several methods for developing synthetic data. These are ideas where you can take real data present in the model and have the model interact with real data in some way to produce additional or different data,” explained Amodei.

Citing AlphaGo, he said, it is actually possible to inject very small amounts of new information to get more than you started with.

“If you go back to systems eight years ago, so if you remember AlphaGo, note that the model there just trains against itself with nothing other than the rules of Go to adjudicate,” he added, saying that those little rules of Go, the little additional piece of information, is enough to take the model from “no ability at all to smarter than the best human at Go.”

Amodei believes that if we do it right, with just a little bit of additional information, we can create an infinite data generation engine. For those unaware, AlphaGo systems were trained by reinforcement learning, where the neural networks were initially bootstrapped from human gameplay expertise.

Meta’s AI chief, Yann LeCun, a self-supervised-learning proponent, has slightly different views, criticising reinforcement learning for being inefficient and impractical for real-world applications when used on its own.

“A lot of the success of machine learning at least until fairly recently was mostly with supervised learning. Reinforcement learning gave some people a lot of hope, but turned out to be so inefficient as to be almost impractical in the real world, at least in isolation, unless you rely much more on something called self-supervised learning, which is really what has brought about the big revolution that we’ve seen in AI over the last few years,” said LeCun.

Self-supervised learning is a technique used where the model autonomously discovers patterns and structures in data without explicit labels.

Other techniques

Besides reinforcement learning and self-supervised learning, LeCun also discussed other techniques for data generation and training AI systems. This includes generative models such as GANs and VAEs, which generate new data by learning the distribution of existing data. “There are systems of this type that have been trained to produce images and they use other techniques like diffusion models,” he added.

Predictive learning models are also another interesting method, which forecasts future states or missing parts of data to aid in learning representations and dynamics. “A particular way of doing it is you take a piece of data… and then you train some gigantic neural net to predict the words that are missing,” said LeCun.

Then, there are energy-based models, which score data configurations based on their probability of supporting various tasks, including generation and classification. “Energy-based models learn a scalar energy for each configuration of the variables of interest,” explained LeCun.

Joint embedding predictive architectures (JEPA) is another technique for training AI systems. It uses embeddings to predict parts of data from others, facilitating the learning of complex data relationships. “Instead of reconstructing y from x, you run both x and y through encoders… you do the prediction in representation space,” he explained.

Latent variable models also help in data generation. These models integrate hidden variables that explain inherent data variability, which is essential for complex generative tasks. “Latent variable models consist in models that have a latent variable z that is not given to you during training or during tests that you have to infer the value of,” mentioned LeCun.

Lastly, there is hierarchical planning. This technique is crucial for enabling AI systems to operate in complex, real-world environments where decisions need to be made at both strategic and tactical levels. Here, LeCun gave an example of planning a trip to Paris through high-level tasks (like getting to the airport) and detailed steps (like navigating to the departure gate), touching upon the reasoning aspect.

Join us at the Data Engineering Summit 2024 on May 30-31 at the Hotel Radisson Blu in Bengaluru, India, organised by AIM for two days of cutting-edge discussions on data engineering innovation featuring top engineers and innovators from leading tech companies.

Access all our open Survey & Awards Nomination forms in one place