Google Unveils Visual Captions to Revamp Video Communication with Real-Time Visuals

The system suggests relevant visuals during conversations and is part of the ARChat project, which aims to facilitate augmented communication with real-time transcription.
Listen to this story

Google has introduced Visual Captions, a system that uses verbal cues to enhance synchronous video communication with real-time visuals. The system suggests relevant visuals during conversations and is part of the ARChat project, which aims to facilitate augmented communication with real-time transcription.

Read more: Don’t Trust a Programmer Who Knows C++

To gather insights, 10 participants with diverse backgrounds were invited. Discussions resulted in the identification of eight dimensions related to visual augmentation in conversations, including the timing of visual augmentations, their role in expressing and understanding speech content, the types and sources of visual content, meeting scale and setting considerations, privacy settings, initiation of interaction, and methods of interaction.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Based on the initial feedback, Visual Captions was designed to generate synchronous visuals that are semantically relevant to the ongoing conversation. The system was tested in various scenarios, including one-to-one remote conversations, presentations, and group discussions.

To train the system effectively, a specific dataset called VC1.5K was created, consisting of language, visual content, type, and source pairs across different contexts. The model was trained using a large language model and the dataset, surpassing keyword-based approaches and achieving high accuracy.

User studies were conducted to assess the effectiveness of Visual Captions. Participants found the visuals to be informative, high-quality, and relevant. The visual type and source accurately matched the conversation’s context.

Read more: Data Science Hiring Process at MediBuddy

Visual Captions was developed on the ARChat platform, integrating interactive widgets onto video conferencing platforms like Google Meet. The system captures user speech, predicts visual intents in real time, retrieves relevant visuals, and suggests them to users. It offers three levels of proactivity in suggesting visuals: auto-display, auto-suggest, and on-demand-suggest.

User studies, including controlled lab studies and in-the-wild deployment studies, were conducted to evaluate the system. Real-time visuals were found to enhance conversations by explaining concepts, resolving language ambiguities, and increasing engagement. Different levels of proactivity were preferred in different social scenarios.

To sum up, Visual Captions is a system that augments verbal communication with real-time visuals. It has been trained using a dataset of visual intents and deployed on the ARChat platform. The system improves communication by leveraging visual cues and provides a foundation for further research in this field. By recognizing the importance of visuals in everyday conversations, more effective communication tools can be developed to enhance human connections.

Shritama Saha
Shritama Saha is a technology journalist who is keen to learn about AI and analytics play. A graduate in mass communication, she is passionate to explore the influence of data science on fashion, drug development, films, and art.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox