Listen to this story
|
Google has introduced Visual Captions, a system that uses verbal cues to enhance synchronous video communication with real-time visuals. The system suggests relevant visuals during conversations and is part of the ARChat project, which aims to facilitate augmented communication with real-time transcription.
Read more: Don’t Trust a Programmer Who Knows C++
To gather insights, 10 participants with diverse backgrounds were invited. Discussions resulted in the identification of eight dimensions related to visual augmentation in conversations, including the timing of visual augmentations, their role in expressing and understanding speech content, the types and sources of visual content, meeting scale and setting considerations, privacy settings, initiation of interaction, and methods of interaction.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Based on the initial feedback, Visual Captions was designed to generate synchronous visuals that are semantically relevant to the ongoing conversation. The system was tested in various scenarios, including one-to-one remote conversations, presentations, and group discussions.

To train the system effectively, a specific dataset called VC1.5K was created, consisting of language, visual content, type, and source pairs across different contexts. The model was trained using a large language model and the dataset, surpassing keyword-based approaches and achieving high accuracy.

User studies were conducted to assess the effectiveness of Visual Captions. Participants found the visuals to be informative, high-quality, and relevant. The visual type and source accurately matched the conversation’s context.
Read more: Data Science Hiring Process at MediBuddy
Visual Captions was developed on the ARChat platform, integrating interactive widgets onto video conferencing platforms like Google Meet. The system captures user speech, predicts visual intents in real time, retrieves relevant visuals, and suggests them to users. It offers three levels of proactivity in suggesting visuals: auto-display, auto-suggest, and on-demand-suggest.
User studies, including controlled lab studies and in-the-wild deployment studies, were conducted to evaluate the system. Real-time visuals were found to enhance conversations by explaining concepts, resolving language ambiguities, and increasing engagement. Different levels of proactivity were preferred in different social scenarios.

To sum up, Visual Captions is a system that augments verbal communication with real-time visuals. It has been trained using a dataset of visual intents and deployed on the ARChat platform. The system improves communication by leveraging visual cues and provides a foundation for further research in this field. By recognizing the importance of visuals in everyday conversations, more effective communication tools can be developed to enhance human connections.