Google has introduced Room-Across-Room, a first of its kind multilingual dataset for vision and language navigation (VLN). VLN refers to the task of navigating an agent (like a robot) inside 3d environments as per instructions in natural language.
VLN is a major challenge in machine learning. While the technology today is advanced enough for robot-like agents to navigate complex human environments, following natural language commands is still a significant roadblock.
However, RxR dataset addresses the role of language in VLN, helping systems to connect the language commands to the physical world it describes, in a better way.
What Is The VLN Challenge?
Even toddlers find it easy to follow verbal cues and commands such as ‘turn left’, ‘walk straight’, etc. However, for agents, following even the simplest of navigation commands is a big challenge due to two reasons:
- The complexities that arise due to linguistic variations of the commands
- The noisy visual signals with rich dynamics from the real-world environment
It is important to mark the distinction between VLN and other vision and language tasks where visual perception is usually fixed. In the case of VLN, the agent interacts with its environment and the pixels it perceives changes as it moves. That is why the agent must learn to coordinate correctly the visual input and the right course of action, based on its perception of the world and commands in natural language.
The RxR Dataset
The RxR dataset is the first multilingual dataset for VLN with 126,069 human-annotated navigation commands in three languages — English, Hindi, and Telugu. Each of these commands describes a path through a photorealistic simulator with simulations of indoor environments such as the 3D captures of homes, offices, and public buildings, all included in the Matterport3D dataset.
RxR also includes a detailed multimodal annotation called pose trace. Pose trace allows correspondence between language, vision, and movement in a 3D setting. In this model, there are two types of components–Guide and Follower. First, the Guide annotators move along the given path in the simulated environment while narrating the environment. The pose trace records such observations and time-aligns it with navigation instructions. Next, the Follower annotators follow the intended path by listening to the audio from the Guide.
The RxR dataset contains up to 10 million words, ten times more than any other dataset existing today, including R2R and Touchdown/Retouchdown. Language tasks that need to learn through movement or interact with their environment suffer from a lack of data.
Further, the RxR benchmark overcomes the challenges of previous models such as R2R in which all the paths have the same length, and the agent generally chooses the shortest path to get to the target. Instead, with RxR:
- Agents can choose paths that are longer and less predictable.
- Agents can navigate through indirect paths to reach its destination
- It enables uniform coverage of environment viewpoints which in turn maximises the diversity of landmarks over the path
It is an important advantage because the model can be generalised to newer environments and offers greater fidelity to the navigation instruction.
Google’s team hopes that RxR can expand the scope of research on the natural language learning and also reduce the dependence on high-resource language such as English for giving instructions to the agents.
Read the full paper here.
Along with the RxR dataset, Google has also released the Panoramic Graph Environment Annotation toolkit (PanGEA) — a custom web-based annotation tool for collecting RxR dataset. It includes speech recording and virtual pose tracking. PanGEA can also align the resulting pose trace with the manual transcript.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
I am a journalist with a postgraduate degree in computer network engineering. When not reading or writing, one can find me doodling away to my heart’s content.