Last month, Lex Fridman had an interesting discussion with Yann LeCun and Yoshua Bengio at Meta’s Inside the Lab event. The trio discussed the latest advancements in AI and machine learning and possible paths to human-level intelligence.
“I believe we are still far from human-level AI,” said Bengio. He said one of the ways to think about the gap is to look at problems humans are good at tackling compared to machines. He said we can take inspiration from how the brain switches.
We know a lot about conscious processing that we can integrate into machine learning. This includes how knowledge is represented in a modular way, how pieces of knowledge can be reused to solve new tasks on the fly, how species communicate with each other through stochastic hard attention etc, Bengio explained.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
“All that can be done with neural networks,” said Bengio.
Out of distribution
Bengio said training a system with a dataset poses practical problems. The data is being collected in a particular way, maybe in some country, and when you deploy the system in different places, at different times, it breaks down.
For instance, if you have learned to drive in North America, driving in London (keeping left) for the first time can be challenging, Bengio explained.
“So, out of distribution is, you learn in one city, and you have to be able to transfer that and operate successfully in another city. That is a fundamental aspect of human-level intelligence. Humans are able to somehow do this kind of thing by taking a leap into the unknown,” said Bengio.

Machine intelligence
“We can see that humans and animals can learn new skills, or acquire new knowledge much faster than any of the AI systems that we’ve built so far, or that we have conceived,” said LeCun.
“So, what kind of learning do humans use that we are not able to reproduce in machines?” asked LeCun. A teenager can learn to drive with 15-20 hours of practice, whereas millions of hours of training in different environments is not adequate for cars to drive themselves with the same degree of reliability, he said.
Citing Bengio’s example of how humans learn to drive in different settings, LeCun said the laws of physics do not change when you move from North America to Britain. He said this allows the teenager to drive and not have to try to run off a cliff to see what happens. Whereas the AI system will have to run off the cliff to figure out that it is a bad idea; probably do it for a few thousand times before it realises how not to do it. So, that is what we are missing, said LeCun.
Self-supervised learning
“We don’t know yet how to do this with machines, but we have a few ideas like self-supervised learning,” said LeCun. Meta has made major strides in self-supervised learning with MoCo, Textless NLP, DINO, 3DETR, DepthContrast, data2vec, etc.
“We are sort of in the early steps in thinking about the world model,” said Bengio. He said the model needs to be structured to generalise for ‘out of distribution’ use cases. “I am currently thinking of new algorithms that precisely allow me to do these kinds of things,” he added.
Recently, Bengio & co published ‘Bayesian structure learning with generative flow networks’. He proposed GFlowNet as an alternative to MCMC (Markov Chain Monte Carlo) for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. In this research, the team evaluated ‘simulated and real data’ and showed a new approach called DAG-GFlowNet, which provided an accurate approximation of the posterior over DAGs, and compared well against other methods based on MCMC or variational inference. MCMC is one of the most popular methods for sampling high-dimensional distributions.
Large differentiable neural networks
We need to look into two aspects – what is the paradigm of learning that you have to use, and what is the architecture of the system that will learn this?, LeCun said.
He said deep learning will be part of the solution. “So, it might be some giant deep neural net that we train with a gradient-based type of algorithm because that is pretty much the only weapon we have at the moment for this kind of problem. At least, the only one that is efficient enough,” said LeCun.
However, when it comes to learning paradigms, LeCun said, he is a big advocate of self-supervised learning. “Human and nonhuman animals seem able to learn enormous amounts of background knowledge about how the world works through observation and through an incomprehensibly small amount of interactions in a task-independent, unsupervised way,” LeCun said. “It can be hypothesized that this accumulated knowledge may constitute the basis for what is often called common sense.”
Common sense allows humans and animals to predict future outcomes and fills in the missing information. For example, when a doorbell rings, you know that someone is waiting outside – even without seeing the person.
However, the idea that humans, non-humans and intelligent systems use world models goes back in time, particularly in the field of psychology and engineering.
“So, what kind of self-supervision might allow us to do this?”asked LeCun.
Citing the video prediction example, LeCun said, when you see a video, you train the system to predict the next frames in the video. It is quite challenging as you have to reconstruct all the details of the pixels. LeCun suggested building an architecture where the prediction occurs at a higher level of abstraction where the useful information is present.
LeCun and his team at Meta are looking to make machines learn world models in a self-supervised way and then use them to predict, reason, and plan.
Check out the key highlights of Meta’s Inside the Lab event below:
Baby steps to human-level intelligence
“If we look at the human condition, the thoughts that we have involve a discrete choice among different alternatives,” said Bengio.
LeCun suggested an architecture composed of six separate modules. Each is assumed to be differentiable, where it can easily compute a gradient estimate of some objective function to its input and propagate the gradient information to upstream modules.
(Source: Meta)
- The configurator module performs executive control
- The perception module accepts signals from sensors and estimates the current state of the world
- The world model module estimates missing information about the state of the world not provided by perception and predicts plausible future states of the world
- The cost module processes a single scalar output that predicts the agent’s level of discomfort. It consists of two submodules – intrinsic cost, violation of hard-coded behavioural constraints, and the critic
- The actor module computes proposals for action sequences
- The short-term memory module tracks the current and predictive world state, as well as associated costs
Check out Meta’s Inside the Lab event here.
Meta’s bet on human-level AI
Self-supervised learning, modularity, and world models are the initial steps towards human-level intelligence. Meta is confident that their research will continue to produce a deeper understanding of both minds and machines, and help crack AGI in the near future.