Google USM Shatters Language Barriers with Multilingual Speech Recognition Model

The model's encoder is pre-trained on a vast unlabeled multilingual dataset of 12 million hours that covers over 300 languages.
Google text-to-speech
Listen to this story

A team of researchers at Google have published a research paper ‘Google USM: Scaling Automatic Speech Recognition’ that introduces the Universal Speech Model (USM) – a single large model that performs automatic speech recognition (ASR) in more than 100 languages. 

The model’s encoder is pre-trained on a vast unlabeled multilingual dataset of 12 million hours that covers over 300 languages and fine-tuning on a smaller labelled dataset. Multilingual pre-training with random-projection quantization and speech-text modality matching is used to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. 

The team also demonstrates that despite using a labelled training set one-seventh the size of that used for the Whisper model, their model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Training Process

The training process involves three different types of datasets to achieve their goal. 

Download our Mobile App

The process of training is divided into three stages. In the initial phase, a conformer backbone is trained using a vast speech dataset that has not been labelled. The objective is to optimize the BEST-RQ. In the next phase, the speech representation learning model is trained further, aiming to optimize several objectives such as BEST-RQ on unlabeled speech, modality matching, supervised ASR, duration modelling losses on paired speech, and transcript data, as well as the text reconstruction objective with an RNN-T decoder on unlabeled text. The final phase involves fine-tuning the encoder that was pre-trained in the earlier stages, for the ASR or AST tasks.

Firstly, they use unpaired audio datasets, including the YT-NTL-U, which is a massive collection of over 12 million hours of audio content sourced from YouTube, in over 300 different languages, without labels. The second unpaired audio dataset is Pub-U, which contains over 429,000 hours of speech content in 51 languages, sourced from public datasets, without labels.

Secondly, the team uses an unpaired text dataset called Web-NTL, which contains more than 28 billion sentences in over 1,140 different languages.

Lastly, the team utilizes two paired ASR data corpora, each containing over 10,000 hours of audio content with matching text for supervised training. The first corpus is YT-SUP+, which includes 90,000 hours of labelled data from 73 languages and an additional 100,000 hours of pseudo-labelled en-US data generated using the Noisy Student Training (NST) technique from the YT-NTL-U dataset. The second corpus is Pub-S, which includes 10,000 hours of labelled data from multi-domain en-US public sources, and an additional 10,000 hours of labelled data from public sources in over 102 languages.

The process for building 2B-parameter Conformer models using various datasets involves the following steps:

The encoder of the model is pre-trained using BEST-RQ, which is based on BERT and incorporates Randomprojection Quantizer. This unsupervised pre-training is performed using YT-NTL-U.

The model can be further prepared through a multi-objective supervised pre-training pipeline called MOST, which utilizes a combination of three datasets: YT-NTL-U, Pub-U, Web-NTL, and Pub-S. During this process, the BEST-RQ masked language model loss is combined with text-injection losses, including supervised ASR loss and modality matching losses. These losses are optimized through a weighted sum during training.

To prepare the model for downstream tasks, generic ASR models are trained using connectionist temporal classification (CTC) and Listen, Attend, and Spell (LAS) transducers in supervised ASR training.


The team’s USM models have achieved excellent results in multilingual ASR and AST for various datasets in different domains. These include SpeechStew (ASR for a single language), CORAAL (ASR for African American Vernacular English), FLEURS (multilingual ASR), YT (long-form ASR for multiple languages), and CoVoST (AST from English to multiple languages). We also developed an ASR model for YouTube captions that performs better than Whisper, a general ASR system trained on more transcribed data, in 18 selected languages.

They also found that BEST-RQ pre-training is an effective method for scaling speech representation learning to large datasets. When combined with text injection in MOST, it improves the quality of downstream speech tasks, achieving state-of-the-art performance on the FLEURS and CoVoST 2 benchmarks.

The MOST representations can adapt quickly to new domains by training lightweight residual adapter modules, which add only 2% more parameters while keeping the rest of the model frozen.

To improve the performance of ASR models on long-form speech inputs, the team also introduced chunk-wise attention, a scalable method that produces high-quality transcripts for long utterances in the YouTube evaluation sets when used with USM-CTC/LAS models.

Meanwhile, last year, NVIDIA’s NeMo framework was used to develop a Telugu automatic speech recognition (ASR) model, which won IIIT-Hyderabad and Telugu ASR Challenge competitions with a low word error rate. The toolkit is open-source and supports multi-GPU and multi-node cluster training. The team preprocessed the data, removed errors, and trained the model for 160 epochs using NeMo, while fine-tuning pre-trained models using an NVIDIA DGX system for the open track. The results showed superiority over other ASR frameworks, achieving almost 2% fewer word errors than the runner-up.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Shritama Saha
Shritama is a technology journalist who is keen to learn about AI and analytics play. A graduate in mass communication, she is passionate to explore the influence of data science on fashion, drug development, films, and art.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.