How Apple Will Use Self-Play To Reduce Collisions For Self-Driving Vehicles

With other tech giants like Intel, Microsoft, and Amazon slowly taking a look into research and development in the self-driving sector and announcing their upcoming projects, Apple has been on the quieter side of it when it comes to being open regarding their plans on self-driving projects; which is something one doesn’t expect from Apple. Though 2020 might be the year where they finally make some big announcements or make a splash by raising the curtains on their improved designs, Apple might finally bring out Project Titan, which is a joint venture of Volkswagen and a startup they acquired in 2019,

This week, a paper that was published on detailed Apple’s plans to make their self-driving project more robust, sophisticated using the deep reinforcement learning paradigm with self-play.

Training AI to Merge in Traffic

Because merging behaviours are complex and require accurate prediction of intentions and reactions, the traditional hard-coded behaviours lead to poor results, that is why reinforcement learning is used. RL directly learns policies through repeated interactions with an environment.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Apple’s research scientist, Yichuan Charlie Tang, detailed in the paper how they have tried to demonstrate an iterative procedure of self-play. This self-play procedure can create progressively more diverse environments which will lead the agents in learning more sophisticated and robust policies. This demonstration is done in a challenging environment where agents must negotiate with each other to successfully merge on or off the road in a challenging multi-agent simulation of traffic.

Though at the start the environment is quite simple for these agents to interact, as the agents learn, the complexity of the environment increases and diverse sets of agents are added to this agent ‘zoo’. Through self-play, these agents learn behaviours like defensive driving, yielding, overtaking even the use of signal lights to communicate ‘intentions’ with other agents.

Apple’s Study To Make Self-Driving Better

In this study, the researches have implemented self-play within a two-dimensional simulation of traffic on the geometry of real roads interpreted by satellite imagery.

This virtual representation of real roads imagery from stellite was then populated of agents who were capable of lane-following and safe lane changing. If simple rule-based policies are used in an environment like these, then it becomes evident that these rule-based policies are insufficient when it comes to dealing with the complexities in the environment. So, the RL can give out better results when it comes to dealing with these complex environment and train policies in the presence of the basic rule-based agents.

But the RL policy overfits to the distribution of the behaviours of the basic rule-based agents and overfitting, which is a problem. To counter overfitting, Tang and the team devised an iterative self-play algorithm where the previously trained RL policies are mixed with rule-based agents. This is done to create a diverse population of agents, and these agents can be controlled simultaneously by the policies in self-play in the simulation.

As the training of these agents goes on, these agents evolve. The agents subsequently learn to play in the increasing complexity and a more diverse environment. Goals of these agents are to be capable of determining when to slow down, accelerate, pick the gap to merge into. The agents must also learn to communicate intentions to other agents through there observable behaviour or via turn indicator signals and most ultimately learn to estimate and predict the latent goals and beliefs of other agents present in the complex environment.

Throughout the training processes, these agents worked in an environment of zipper merges where it is difficult because the left lane driver intends to merge with the right lane and vice versa. These merges are where the turn signals are used to negotiate who goes first and which gap is filled in a short amount of time.

The Simulation

Each simulation consisted of one AI-controlled agent in a zoo of rule-based agents that performed (from a lane using adaptive cruise control) tasks like slowing down and speeding up concerning the vehicle in front. Because of the reward system, gradually the AI agent replaced the rule-based agents. These were rewarded for completing a merge and travelling any speed up to 15 meters per second which is around 33.6 miles/hr.

Thirty-two simulation episodes were made to run parallel using Nvidia Titan X graphics card. In each episode, around ten agents were launched with separate random destinations. These episodes ended only after 1,000 timesteps when collisions occurred, or the goals/destinations were reached. All this process was done in a process which consisted of three stages:

  • 1st stage: The AI was trained in the sole presence of rule-based agents.
  • 2nd stage: The self-play was trained with 30% IDM agents, 40% of agents controlled by current policy and 30% RL agents from stage one.
  • 3rd stage: Agents were in this stage from stage 2.

Problem with Rule-based

During the simulation when the AI agent was with the rule-based agents, it gradually replaced them because these rule-based kept getting penalised for going out of bounds or colliding with other agents or drifting away from the centre lane. As RL rewards the agents, it the AI agent stayed in the environment.

And often where a collision occurred involving AI agents, rule-based agents were found to be the culprit. These rule-based also showed a tendency to brake suddenly, and because the AI-agents is Ultra aggressive and never yield, which resulted in a collision.

Outlook for Apple’s Self-Driving AI Agent

The researchers conducted around 250 random trials without adding exploration noise. So, when one compares them with rules-based agents, the success rate difference was of 35%, that is, compared to the rule-based 63%, the AI agents achieved 98% success rates over rule-based and other AI-agents via the proposed self-play strategy by Tang and team.

The self-trained agents learned human-like behaviour while driving. One of the reasons Apple is quiet is because the algorithm isn’t perfect. After all, these agents are still susceptible to collisions when braking and steering, but this self-play proposed way, according to Tang and team, might open the door to work with zero collisions.

Sameer Balaganur
Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.