The Promise Of Maximum Entropy Reinforcement Learning

Published on March 15, 2021
by Ram Sagar

Last year, Walmart scrapped its ambitious plan to deploy inventory tracking robots. The retail giant has ended its contract with Bossa Nova Robotics citing “different and simpler solutions” could be as effective. This certainly indicates that the much anticipated AI takeover is still a distant dream. Scaling introduces new challenges in robotics or in any domain that tries to automate. In robotics, the reinforcement algorithms that try to guide the robot are spoonfed to do certain tasks in a decent way. But, most of the reinforcement learning applications suffer from lack of robustness. They do fine as long as there are no perturbations in the ecosystem they are deployed in. But, the real world is filled with such uncertainties and one cannot hard code all possible edge cases into a system. It is not practical. You want the reinforcement learning system to adapt to new conditions. Scaling reinforcement learning algorithms and applying them in the real world requires the system to learn policies smart enough to spot the changes in the environment and adapt.

Algorithms looking to address robustness problems in RL typically take an existing reinforcement learning algorithm and add some additional machinery on top. But these approaches, wrote the researchers, are difficult to implement and require tuning of many additional hyperparameters. The researchers wanted to explore a solution that is simpler and does not require additional hyperparameters and additional lines of code to debug. And, they found maximum entropy RL (MaxEnt RL).

Overview Of Maximum Entropy

Standard RL vs MaxEnt RL (Image credits: BAIR)

Unlike Standard RL, MaxEnt RL is designed to learn a policy that gets high reward while acting as randomly as possible. MaxEnt maximizes the entropy of the policy. The policies learn to maximize reward while acting randomly.They are trained for perturbations. For example, in the above illustration, the researchers tasked the robot with pushing the white object to the green region. As shown on the left Standard RL always takes the shortest path to the goal, whereas MaxEnt RL(on right) acts randomly.

When the researchers tried to introduce obstacles(red blocks) to the environment that wasn’t included during training, the policy learned by standard RL almost always collides with the obstacle, rarely reaching the goal. Whereas, the MaxEnt RL policy often chooses routes around the obstacle, continuing to reach the goal for a large fraction of trials.

For the policy to be robust to a larger set of disturbances, the researchers recommend to increase the weight on the entropy term and decrease the weight on the reward term (as defined below). The idea here is that the adversary must choose dynamics that are “close” to the dynamics on which the policy was trained.

via BAIR

MaxEnt RL J_MaxEnt(π;p,r) is the entropy-regularized cumulative return of policy “π” evaluated using:

dynamics p(s′∣s,a)
reward function r.
train the policy using one dynamics p,
evaluate n dynamics chosen by adversary p~

Image credits: BAIR

The Berkeley researchers found MaxEnt RL policies to be robust even if the environment is rigged against perturbations. To demonstrate this, the researchers trained both standard RL and MaxEnt RL on a peg insertion task as shown above. During evaluation, the researchers changed the position of the hole to try to make each policy fail. Whenever the hole position was moved a little bit (≤ 1 cm), both Standard RL and MaxEnt RL policies always solved the task. However, when the hole position was moved up to 2 cm, MaxEnt RL policy outclassed its counterpart and continued to succeed for more than 95% of trials. “This experiment,” declared the researchers, “validates our theoretical findings that MaxEnt really is robust to (bounded) adversarial disturbances in the environment.”

Future Is Robust

Image credits: BAIR

Reinforcement learning systems are founded on the principles of Markov decision process (MDPs), which in their ideal state, are unavailable to the learning algorithm in a real-world environment. In practical and scalable real-world scenarios, RL systems usually run into following challenges:

The absence of reset mechanisms,
State estimation
Reward specification

For instance, to address the challenge of reset mechanism — which the RL algorithms almost always assume to access — the team at Berkeley suggested adding perturbation with the help of a perturbation controller to the existing state of the agent so that it never stays in the same state for too long. This perturbation controller is trained with the goal of taking the agent to less explored states of the world.

In reinforcement learning, it is usually easier and more robust to specify a reward function, rather than a policy maximising that reward function. The researchers had also developed methods to compare reward functions directly, without training a policy.

The researchers underline the importance of robustness in RL algorithms. In this work, they have compared standard RL with maximum entropy RL. They found MaxEnt RL to be promising when faced with perturbations. The whole work can be summarised as follows:

Simplicity of Maximum Entropy RL compared with other robust RL algorithms suggests that it may be an appealing alternative.
MaxEnt RL models are implicitly solving a robust RL problem.
MaxEnt RL problem corresponds to maximizing a lower bound on a robust RL problem.
Results show that, even when the environment is optimized for perturbations (so the agent does as poorly as possible), MaxEnt RL policies will still be robust.

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Watch More

The Promise Of Maximum Entropy Reinforcement Learning

Overview Of Maximum Entropy

Future Is Robust

Ram Sagar

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.