Google Unveils New Universal Speech Model, Performs Better than OpenAI Whisper

A critical first step towards supporting 1,000 languages.
Listen to this story

Google researchers have recently unveiled a new update for their Universal Speech Model (USM), to support 1,000 languages. The researchers said that this model performs better than OpenAI Whisper for all segments of automation speech recognition. In addition, better YouTube captions!

Researchers can request access to the USM API here.

The paper, ‘Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages”, shows that a large unlabelled multilingual dataset used to pre-train the encoder of the model and fine-tuned on a smaller set of labelled data enables recognising under-represented languages. Moreover, the training process effectively adapts new languages and data.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The researchers demonstrated the effectiveness of pre-trained encoder through fine-tuning on YouTube Caption’s multilingual speech data. Despite YouTube’s limited supervised data, the model achieves less than 30% word error rate on average across the 73 languages, a milestone never achieved before. The model has, on average, a 32.7% relative lower WER compared to Whisper (large-v2), which was trained with more than 400k hours of labelled data for these 18 languages. USM also outperforms Whisper for all segments of automation speech recognition.

The 1,000 Languages Initiative to build a machine learning model that would support the world’s thousand most-spoken languages for better inclusivity globally was launched last November. However, some of these languages are spoken by fewer than twenty million people, so the principle challenge is to figure out a way to support languages with few speakers or limited available data. 

Download our Mobile App

The USM is a group of speech models that have two billion parameters and were trained on a vast dataset of 12 million hours of speech and 28 billion sentences of text, covering over 300 languages. The models are used in YouTube (for closed captions) and can perform automatic speech recognition not only on widely-spoken languages, but also on under-resourced languages like Amharic, Cebuano, Assamese, and Azerbaijani to name a few. 


The updated model uses the standard encoder-decoder architecture. The Conformer, or convolution-augmented transformer, is used as an encoder. The important factor is the Conformer block, consisting of attention, feed-forward, and convolutional modules. It takes as input and performs a sampling, after which Conformer blocks along with a projection layer are applied to obtain the final embeddings.

The model’s training starts with self-supervised learning on speech audio covering hundreds of languages. To do so, BEST-RQ, which is efficient on multilingual tasks when using very large amounts of unsupervised audio data, is used.

In the second optional step, the researchers used multi-objective supervised pre-training to incorporate additional text data to improve the model’s quality and language coverage. The decision to incorporate the second step depends on whether text data is available but USM performs best with this step.

In the last stage, the model is fine-tuned on the downstream tasks. With pre-training, it demonstrates quality results with a small amount of supervised data from the tasks.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Tasmia Ansari
Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.