Now Reading
How To Automate Reward Design For Reinforcement Learning Systems

How To Automate Reward Design For Reinforcement Learning Systems

Ram Sagar

Despite the success of reinforcement learning algorithms, there are few challenges which are still pervasive. Rewards, which make up for much of the RL systems, are tricky to design. A smarter reward system ensures an outcome with better accuracy. 

In the context of reinforcement learning, a reward is a bridge that connects the motivations of the model with that of the objective. 

Reward design decides the robustness of an RL system. Designing a reward function doesn’t come with much restrictions and developers are free to formulate their own functions. The challenge, however, is the chance of getting stuck in local minima.

Reward functions are peppered with clues to make the system/model/machine to move in a certain direction. The clues in this context are a bunch of mathematical expressions that are written with efficient convergence in mind.

Automating Reward Design

Machine learning practitioners, especially those who deal with reinforcement learning algorithms, encounter a common challenge of making the agent realise that certain task is more lucrative than the other. To do this, they use reward shaping. 


During the course of learning, the reward is edited based on the feedback that is generated on completion of tasks. This information is used to retrain the RL policy. This process is repeated until the agent performs desirable actions.

The challenges to retrain policies and observing for long durations makes one question if reward design can be automated and if there can be a proxy reward that while promoting the learning, also meets the task objective. 

In an attempt to automate the reward design, the Robotics department at Google, introduced AutoRL, a method that automates RL reward design by using evolutionary optimisation over a given objective.

To measure the effectiveness, the team at Google, applied AutoRL’s evolutionary reward search to four continuous control benchmarks from OpenAI Gym, including:

  1. Ant
  2. Walker2D
  3. HumanoidStandup
  4. Humanoid

These were applied over two RL algorithms: off-policy Soft Actor-Critic and on-policy Proximal Policy Optimisation.

To assess AutoRL’s ability to reduce reward engineering while maintaining the quality of existing metrics, the team considered task objectives and standard returns.

See Also
After Poker And Go, Reinforcement Learning Is Now Beating Mahjong Players

Task objectives measure task achievement for continuous control: distance traveled for Ant, Walker, and Humanoid, and height achieved for Stand Up. Whereas, standard returns are the metrics by which tasks are normally evaluated.

Key Findings

The authors, in their paper, list the following findings:

  • Evolving rewards trains better policies than hand-tuned baselines, and on complex problems outperforms hyperparameter-tuned baselines, showing a 489% gain over hyperparameter tuning on a single-task objective for SAC on the Humanoid task. 
  • Second, the optimisation over simpler single-task objectives produces comparable results to the carefully hand-tuned standard returns, reducing the need for manual tuning of multi-objective tasks. 
  • Lastly, under the same training budget, reward tuning produces higher-quality policies faster than tuning the learning hyperparameters.

The complexity of reward design has also led to the development of RL systems with alternative reward systems. These systems assess sequential social dilemmas in a multi-agent environment to monitor the influence of one agent’s action over the other. These approaches showed promising results in scenarios where there is very little scope for designing reward systems. 

Design of reinforcement learning systems gets special attention owing to their implications in the physical world. Unlike other machine learning models, RL systems are of great use in domains such as robotics. 

Be it a pick and place robot that is learning to drop fragile objects or surgeon tool that needs to make micro-cuts, the outcomes are usually of critical nature.

Provide your comments below


Copyright Analytics India Magazine Pvt Ltd

Scroll To Top