Now Reading
Why The Upside-Down Reinforcement Learning Is A Paradigm Shift In AI

Why The Upside-Down Reinforcement Learning Is A Paradigm Shift In AI

Rohit Yadav
Upside Down Reinforcement Learning

Jürgen Schmidhuber, a computer scientist at the Swiss AI Lab, transformed the way reinforcement learning works. Schmidhuber, along with the team at the Swiss AI Lab, innovated and devised a methodology to carry out RL in the form of supervised learning. In other words, the team is turning traditional RL on its head, calling it, Upside Down Reinforcement Learning (UDRL). Unlike traditional RL, where the focus is to maximise rewards based on actions, the Swiss AI Lab managed to map action of AI agent by providing rewards as inputs.

RL is vital for advancement in artificial intelligence landscape, but new methodologies in RL can become foundational for further progress.

Upside Down Reinforcement Learning

Source: deeplearn

UDRL learns to interpret input observations (rewards) as commands, mapping them to actions through SL on past experience. UDRL takes rewards as inputs for learning to interpret input observations as commands, thereby mapping them to actions through SL. For instance, UDRL models observe commands in the form of desired rewards: “get so much reward within so much time.”

To do so, AI agents interact with environments and learn through gradient descent – an algorithm used to minimise function by iterative actions – to map self-generated commands for corresponding action probabilities. The research paper mentions that such self-acquired knowledge can extrapolate to solve new problems such as: “get even more reward within even less time.” 


To check the performance of the AI agent, the firm deployed the technique in various environments such as deterministic environments, probabilistic environments, partially observable environments, and more, to get exceptional results. Even a pilot version of the UDRL was successfully outperforming traditional RL models on some challenging issues.

Citing a broader use case of UDRL, the Swiss Lab described that human could imitate the robot to make it lean to impersonate human. This methodology can be deployed to train robots just by demonstrating a visual task. Here the task will be the command in the form of reward, and the robot will map with an action.

For example, if you want the robot to assemble a smartphone, you will have to do the task in front of the robot, which will be recorded through the camera. That recorded video is used as a sequential command input for the RNN model, resulting in training through supervised learning to imitate you, thereby completing the smartphone assembling task.

Further on providing numerous such videos, it will learn to generalise and carryout task that it has never done. However, if it fails to deliver the desired outcomes, one can again train to improve its accuracy.

The concept of learning to use rewards and other goals as command inputs has broad applicability. The above Imitate-Imitator approach is not limited to videos; One can describe the robot in his/her language through text or speech to enable the robot to map the descriptions.

See Also

Mapping action by taking in command can be the way for training ML models through SL.


Instead of making it learn on its own, helping it learn in a guided way through commands will allow firms to bring products to the market quickly. Almost all organisations struggle to train ML-models as it takes several weeks to enhance accuracy. Especially for robotics companies, instead of hard coding to instruct the machine, they can teach it by imitating in front of the robots. Besides, UDRL might help the self-driving car firms to manoeuvre on roads without going through millions of miles in simulated and real-world environments. 


Classical RL approach uses a myriad of methods for determining prediction into actions, but UDRL eliminates the iteration through numerous methodologies and crates a direct mapping from rewards as inputs. Maximising the reward was a tedious task and required changing the reward functions and hyperparameters. And often it involved hit and trial method to get the right balance in achieving rewards while completing the task. The gradient descent methodology assists in mitigating such strenuous tasks to expedite the learning for achieving immediate results. 

Although it has potential drawbacks such as local minima, underfitting, and overfitting, UDRL has outperformed traditional RL model. Therefore, it has the ability to remodel the whole AI landscape.

What Do You Think?

If you loved this story, do join our Telegram Community.

Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
What's Your Reaction?
In Love
Not Sure

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top