Interaction between robots and their environment is an exciting area of study. In a lab or a controlled environment, multiple robots could be coordinated through a centralized planner. However, in real-life applications, a centralized planner may not be feasible.
For example, rendezvous task. Here, a group of wheeled robots must agree to meet at a time and place without explicitly communicating with each other. During the task, the individual robots must maintain network connectivity and avoid collisions. There are two main challenges to perform a decentralized rendezvous task: the obstacles in the environment and the policies and dynamics of each agent. An individual robot must have the ability to model its and other agents’ motion and adapt to diverging intentions while using limited information.
Google’s research team has proposed hierarchical predictive planning (HPP), a decentralized model-based reinforcement learning system to enable agents to align their goals on the fly. The team demonstrated that HPP is more effective in predicting and aligning trajectories, avoiding miscoordination, and transferring to the real world without additional fine-tuning.
Hierarchical predictive planning
HPP was first introduced at the Conference on Robot Learning 2020 in a paper titled ‘Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous’.
Credit: Google AI blog
The learning system consists of three modules: Prediction, planning, and control.
Each agent employs this module to learn agent motion and predict its own and other agents’ future position using ego agent and LiDAR systems, respectively. The motion predictors make up the prediction module and are used by each agent’s planning module. The output of the prediction module is an essential input for the planning module. It evaluates different goal locations and maintains a belief distribution that gives information about where the team should converge. This belief distribution is periodically updated using evaluations provided by the prediction module.
The control module of the agent is equipped with a pre-trained navigation policy to steer the robot to a given location in an obstacle-ridden environment. The selected goal is given as input to the agent’s control module. The control policy then analyses the best course of action for the robot.
Additionally, the approach proposed by Google’s team closes the loop between the control and the planning module for decentralized multiagent systems by using a sensor-informed prediction module.
Training prediction model
HPP trains motion predictors in simulation. However, these models have no access to other agents’ observations and control policies. These predictors are trained through self-supervision. First, to collect the training data, all the agents and the obstacles are placed in an environment. Each agent is given a random goal, and as they move towards their respective destinations, the agent records its sensor observations and the poses of all the agents.
Next, using this recorded observation, an individual agent learns a separate predictor for every agent and itself. The goals and labels are derived from the recorded experience. Conditioned on the target agent’s goals, the model predicts where each agent will be in future depending on the present position, also called temporal causality. The predictor training is done based on only the information available to the agents at the runtime and in environments independent of the deployment environments.
A model-based RL planner for each agent used the learned predictors in the deployment environment to guide it to the common meeting point. This process simulates a centralized planner for fictitious agents by using prediction models to predict trajectories of agents moving to a fixed goal.
For goal selection, each of the goal options available is evaluated by scoring them using an anticipated system state with task rewarding goals that bring agents closer. A cross-entropy method is used to convert the goal evaluation to belief updates. Finally, the agent’s planner selects a goal and passes it to the agent’s control module.
“There are two main takeaways from our results. One is that HPP enables agents to predict and align trajectories, avoiding miscoordinations. The second takeaway is that HPP transfers directly into the real world without additional training,” the team said in a blog.