MITB Banner

Why The Upside-Down Reinforcement Learning Is A Paradigm Shift In AI

Share

Upside Down Reinforcement Learning

Jürgen Schmidhuber, a computer scientist at the Swiss AI Lab, transformed the way reinforcement learning works. Schmidhuber, along with the team at the Swiss AI Lab, innovated and devised a methodology to carry out RL in the form of supervised learning. In other words, the team is turning traditional RL on its head, calling it, Upside Down Reinforcement Learning (UDRL). Unlike traditional RL, where the focus is to maximise rewards based on actions, the Swiss AI Lab managed to map action of AI agent by providing rewards as inputs.

RL is vital for advancement in artificial intelligence landscape, but new methodologies in RL can become foundational for further progress.

Upside Down Reinforcement Learning

Source: deeplearn

UDRL learns to interpret input observations (rewards) as commands, mapping them to actions through SL on past experience. UDRL takes rewards as inputs for learning to interpret input observations as commands, thereby mapping them to actions through SL. For instance, UDRL models observe commands in the form of desired rewards: “get so much reward within so much time.”

To do so, AI agents interact with environments and learn through gradient descent – an algorithm used to minimise function by iterative actions – to map self-generated commands for corresponding action probabilities. The research paper mentions that such self-acquired knowledge can extrapolate to solve new problems such as: “get even more reward within even less time.” 

Experiments

To check the performance of the AI agent, the firm deployed the technique in various environments such as deterministic environments, probabilistic environments, partially observable environments, and more, to get exceptional results. Even a pilot version of the UDRL was successfully outperforming traditional RL models on some challenging issues.

Citing a broader use case of UDRL, the Swiss Lab described that human could imitate the robot to make it lean to impersonate human. This methodology can be deployed to train robots just by demonstrating a visual task. Here the task will be the command in the form of reward, and the robot will map with an action.

For example, if you want the robot to assemble a smartphone, you will have to do the task in front of the robot, which will be recorded through the camera. That recorded video is used as a sequential command input for the RNN model, resulting in training through supervised learning to imitate you, thereby completing the smartphone assembling task.

Further on providing numerous such videos, it will learn to generalise and carryout task that it has never done. However, if it fails to deliver the desired outcomes, one can again train to improve its accuracy.

The concept of learning to use rewards and other goals as command inputs has broad applicability. The above Imitate-Imitator approach is not limited to videos; One can describe the robot in his/her language through text or speech to enable the robot to map the descriptions.

Mapping action by taking in command can be the way for training ML models through SL.

Potential

Instead of making it learn on its own, helping it learn in a guided way through commands will allow firms to bring products to the market quickly. Almost all organisations struggle to train ML-models as it takes several weeks to enhance accuracy. Especially for robotics companies, instead of hard coding to instruct the machine, they can teach it by imitating in front of the robots. Besides, UDRL might help the self-driving car firms to manoeuvre on roads without going through millions of miles in simulated and real-world environments. 

Outlook

Classical RL approach uses a myriad of methods for determining prediction into actions, but UDRL eliminates the iteration through numerous methodologies and crates a direct mapping from rewards as inputs. Maximising the reward was a tedious task and required changing the reward functions and hyperparameters. And often it involved hit and trial method to get the right balance in achieving rewards while completing the task. The gradient descent methodology assists in mitigating such strenuous tasks to expedite the learning for achieving immediate results. 

Although it has potential drawbacks such as local minima, underfitting, and overfitting, UDRL has outperformed traditional RL model. Therefore, it has the ability to remodel the whole AI landscape.

Share
Picture of Rohit Yadav

Rohit Yadav

Rohit is a technology journalist and technophile who likes to communicate the latest trends around cutting-edge technologies in a way that is straightforward to assimilate. In a nutshell, he is deciphering technology. Email: rohit.yadav@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.