Is Reinforcement Learning Still Relevant?

While there are various practical applications of reinforcement learning, the concept as a whole poses some limitations when used in developing autonomous machine intelligence
Listen to this story

Intelligence consists of various aspects like learning, reasoning, and planning. Human beings, for example, have behavioural, social, and general intelligence, which can be simply termed as common sense. The dichotomy of whether these things are learned or are innately present in living beings, makes us question whether reinforcement learning (RL) or self-supervised learning (SSL) is the way forward towards artificial general intelligence (AGI). 

Researchers and scientists are split about using reinforcement learning or SSL for developing artificial general intelligence. While Google’s DeepMind has been making great progress using reinforcement learning, Meta AI has been continually pushing for self-supervised or unsupervised learning with Tesla jumping on the bandwagon as well.

DeepMind’s famous paper ‘Reward is Enough’ claims that intelligence can be achieved by working on the principle of ‘reward maximisation’, which is essentially expansion of reinforcement learning algorithms and is, arguably, the closest to natural intelligence.


Sign up for your weekly dose of what's up in emerging technology.

“If an agent can continually adjust its behaviour so as to improve its cumulative reward, then any abilities that are repeatedly demanded by its environment must ultimately be produced in the agent’s behaviour,” said researchers at DeepMind.

Yann LeCun from Meta AI has constantly been talking about how the trial-and-error method of RL for developing intelligence is a risky way forward. For example, a baby does not identify objects around by looking at million samples of the same object, or trying dangerous things and learning from them, but instead by observing, predicting, and interacting with them even without supervision.

Download our Mobile App

DeepMind says that by understanding the mammalian vision and implementing neuroscience using computer vision, we can probably categorise objects and differentiate them, but these are constrained to narrow artificial intelligence—systems designed to solve specific problems and not generate general solving abilities.

DeepMind’s David Silver considers that a continual reinforcement learning framework that aims to maximise reward in a cycle “is” enough to produce attributes of human intelligence, like perception, language, and memory. 

Recently, OpenAI used reinforcement learning from human intervention and feedback finetuned GPT-3. The new model, called InstructGPT, is extremely good at generating intentful text from single-sentence prompts. DeepMind has also developed groundbreaking models using reinforcement learning like AlphaGo, AlphaFold, and MuZero.

Reinforcement learning pitfalls 

A dog, when fed with treats after performing a task, remains obedient. This simple explanation of positive reinforcement makes researchers confident that AI can probably also be trained this way. While still in the development stages, reinforcement learning in machines can be quite challenging (a dog has an innate nature or emotions of being obedient).

While there are various practical applications of reinforcement learning, the concept as a whole poses some limitations when used in developing autonomous machine intelligence.

  1. It requires a huge amount of data and computation
  2. Noise in data is one of the major problems with this method of learning. Small training changes can make create large difference in testing results 
  3. Large amount of hyperparameters makes the algorithm hard to tune. A lot of hyperparameters are for shaping the reward which can make the training data biassed as well
  4. Sample inefficiency makes it difficult to train in the real world. For example, as this method does not use CNNs for measuring image or state space, it can take weeks to train an agent to walk even in a simulated environment
  5. Unpredictability of simulation trained agents in the real world
  6. Trial-and-error can be very costly and inefficient when trained in the real world
  7. Assumption that the agent has a finite number of actions (Markov Model)

While reinforcement learning delivers decisions by creating a simulation of a system, training an AI model on a labelled dataset is limiting as the world is not available as a labelled dataset. It is also part of the training process that takes place after the model is deployed and already working.

Amalgamation of SSL & RL 

Researchers agree that installing background knowledge in machines might be the way forward for AGI, however the concept of “background” knowledge is unexplainable. It is not completely evident, deriving the meaning of consciousness from animals, that the majority of the things are learnt with time or are part of our innate mechanism.

Autonomous machine intelligence is the common goal in both these approaches, but with reinforcement training there is always a human agent driving the working of the machine, while unsupervised learning proposes to learn from observation. Self-supervised learning advocates talk about the inefficiency of trial-and-error methods but uncertainty still remains a major barrier for self-supervised learning.

Sergey Levine from Berkeley AI Research recently proposed a solution of combining self-supervised learning with offline-reinforcement learning, that explores the possibility of enabling models to understand the world without supervision and allow reinforcement learning to explore causal understanding of the world, thus expanding the dataset close to infinite.

Yann LeCun proposed the World Model in paper in June 2022, which uses a “cost module” in its architecture that measures the energy-cost of an action by the machine. When reinforcement learning is scaled on larger datasets, the reward maximisation also needs further scaling. If the cost module can be implemented with the reward mechanism of reinforcement learning, the architecture will be able to produce maximum outcomes for as little “energy” as possible, which seems like a plausible way forward.

More Great AIM Stories

Mohit Pandey
Mohit is a technology journalist who dives deep into the Artificial Intelligence and Machine Learning world to bring out information in simple and explainable words for the readers. He also holds a keen interest in photography, filmmaking, and the gaming industry.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is AI sexist?

Genderify, launched in 2020, determines the gender of a user by analysing their name, username and email address using AI.