Last year, Walmart scrapped its ambitious plan to deploy inventory tracking robots. The retail giant has ended its contract with Bossa Nova Robotics citing “different and simpler solutions” could be as effective. This certainly indicates that the much anticipated AI takeover is still a distant dream. Scaling introduces new challenges in robotics or in any domain that tries to automate. In robotics, the reinforcement algorithms that try to guide the robot are spoonfed to do certain tasks in a decent way. But, most of the reinforcement learning applications suffer from lack of robustness. They do fine as long as there are no perturbations in the ecosystem they are deployed in. But, the real world is filled with such uncertainties and one cannot hard code all possible edge cases into a system. It is not practical. You want the reinforcement learning system to adapt to new conditions. Scaling reinforcement learning algorithms and applying them in the real world requires the system to learn policies smart enough to spot the changes in the environment and adapt.
Algorithms looking to address robustness problems in RL typically take an existing reinforcement learning algorithm and add some additional machinery on top. But these approaches, wrote the researchers, are difficult to implement and require tuning of many additional hyperparameters. The researchers wanted to explore a solution that is simpler and does not require additional hyperparameters and additional lines of code to debug. And, they found maximum entropy RL (MaxEnt RL).
Overview Of Maximum Entropy
Standard RL vs MaxEnt RL (Image credits: BAIR)
Unlike Standard RL, MaxEnt RL is designed to learn a policy that gets high reward while acting as randomly as possible. MaxEnt maximizes the entropy of the policy. The policies learn to maximize reward while acting randomly.They are trained for perturbations. For example, in the above illustration, the researchers tasked the robot with pushing the white object to the green region. As shown on the left Standard RL always takes the shortest path to the goal, whereas MaxEnt RL(on right) acts randomly.
When the researchers tried to introduce obstacles(red blocks) to the environment that wasn’t included during training, the policy learned by standard RL almost always collides with the obstacle, rarely reaching the goal. Whereas, the MaxEnt RL policy often chooses routes around the obstacle, continuing to reach the goal for a large fraction of trials.
For the policy to be robust to a larger set of disturbances, the researchers recommend to increase the weight on the entropy term and decrease the weight on the reward term (as defined below). The idea here is that the adversary must choose dynamics that are “close” to the dynamics on which the policy was trained.
MaxEnt RL J_MaxEnt(π;p,r) is the entropy-regularized cumulative return of policy “π” evaluated using:
- dynamics p(s′∣s,a)
- reward function r.
- train the policy using one dynamics p,
- evaluate n dynamics chosen by adversary p~
Image credits: BAIR
The Berkeley researchers found MaxEnt RL policies to be robust even if the environment is rigged against perturbations. To demonstrate this, the researchers trained both standard RL and MaxEnt RL on a peg insertion task as shown above. During evaluation, the researchers changed the position of the hole to try to make each policy fail. Whenever the hole position was moved a little bit (≤ 1 cm), both Standard RL and MaxEnt RL policies always solved the task. However, when the hole position was moved up to 2 cm, MaxEnt RL policy outclassed its counterpart and continued to succeed for more than 95% of trials. “This experiment,” declared the researchers, “validates our theoretical findings that MaxEnt really is robust to (bounded) adversarial disturbances in the environment.”
Future Is Robust
Image credits: BAIR
Reinforcement learning systems are founded on the principles of Markov decision process (MDPs), which in their ideal state, are unavailable to the learning algorithm in a real-world environment. In practical and scalable real-world scenarios, RL systems usually run into following challenges:
- The absence of reset mechanisms,
- State estimation
- Reward specification
For instance, to address the challenge of reset mechanism — which the RL algorithms almost always assume to access — the team at Berkeley suggested adding perturbation with the help of a perturbation controller to the existing state of the agent so that it never stays in the same state for too long. This perturbation controller is trained with the goal of taking the agent to less explored states of the world.
In reinforcement learning, it is usually easier and more robust to specify a reward function, rather than a policy maximising that reward function. The researchers had also developed methods to compare reward functions directly, without training a policy.
The researchers underline the importance of robustness in RL algorithms. In this work, they have compared standard RL with maximum entropy RL. They found MaxEnt RL to be promising when faced with perturbations. The whole work can be summarised as follows:
- Simplicity of Maximum Entropy RL compared with other robust RL algorithms suggests that it may be an appealing alternative.
- MaxEnt RL models are implicitly solving a robust RL problem.
- MaxEnt RL problem corresponds to maximizing a lower bound on a robust RL problem.
- Results show that, even when the environment is optimized for perturbations (so the agent does as poorly as possible), MaxEnt RL policies will still be robust.
Read more here.