Car navigation systems have come a long way since Honda’s 1981 Electro Gyrocator. Cadillac Escalade, with its augmented reality-enabled navigation system projecting a live street view onto the cluster display with directional information, is a good case in point. The system also has audio prompts.
Tesla’s full driving system is expected to implement the vector-space bird’s eye view soon. Mercedes Benz, Jeep, Ford also offer similar, if not more intuitive, navigation systems.
Yet, these systems come nowhere close to human-level performance in understanding or reacting to surroundings. Mitsubishi Electric Research Laboratories (MERL) has developed a scene-aware interaction system that leverages proprietary Maisart compact AI technology to analyse multimodal sensing information and provide intuitive driving instructions through the context-dependent generation of natural language.
How does it work?
Almost a decade in the making, the scene-aware interaction system started as a simple idea in 2012: An intuitive navigation system to guide drivers in a more human-like manner. Building on this, the team at MERL first developed a system capable of viewing a video and answering questions about it. Then, to support experimental autonomous driving, the team needed a novel mapping technology using LiDAR data to locate the vehicle on a three-dimensional map: MERL’s mobile mapping system provided the necessary data with centimetre-level precision.
The next phase was to train deep learning models to detect tens or hundreds of object classes in a scene. Then, they developed a regression neural network model to consider the route and object type, size, depth, and distance from the intersection and its distinctness. Simultaneously, the team created a machine-learning network that graphs the relative depth and spatial locations of all the objects in the scene and then based the language processing on this graph. MERL even improved vehicle distinguishing capabilities by adding colour, the make or model, logos, etc. The final step was to leverage NLP to render the appropriate instruction or warning in the form of a sentence using a rules-based strategy.

Outcome
The scene-aware interaction system uses LiDAR technology to measure multimodal sensing information such as images, videos, audios, and localisation information and recognise contextual objects and events. In addition, it leverages Mitsubishi Electric’s Attentional Multimodal Fusion technology to prioritise salient unimodal information and support appropriate word selections to describe the scene accurately.
Given that most automobiles already come with a host of cameras, millimetre-wave radar, and ultrasonic sensors for safety and autonomous driving, the scene-aware interaction system can be easily integrated as conversational agents to assist drivers with intuitive route guidance.
Navigation systems powered by the scene-aware interaction technology would first identify distinguishable visual landmarks and dynamic elements of the scene to generate intuitive sentences for guidance. For instance, it would instruct the driver to “turn right before the postbox” rather than “turn right in 50m. In the event of an emergency or an imminent threat, it would issue warnings such as “a pedestrian is crossing the street” or any other objects approaching/blocking the car.

Roadmap
Advances in deep neural network-based object recognition, video description, natural language generation, and spoken dialogue technologies enable machines to understand their surroundings better and interact with humans more naturally and intuitively.
The scene-aware interaction technology is expected to have a wide range of applications, including human-machine interfaces for in-vehicle infotainment, interaction with robots in building and factory automation systems, systems that monitor people’s health and well-being, surveillance systems that interpret complex scenes for humans and encourage social distancing, support for touchless operations of public equipment, and much more.