Last updated March 3, 2023
In AI News & Update

Google USM Shatters Language Barriers with Multilingual Speech Recognition Model

The model's encoder is pre-trained on a vast unlabeled multilingual dataset of 12 million hours that covers over 300 languages.

Share

Published on March 3, 2023

by Shritama Saha

Listen to this story

A team of researchers at Google have published a research paper ‘Google USM: Scaling Automatic Speech Recognition’ that introduces the Universal Speech Model (USM) – a single large model that performs automatic speech recognition (ASR) in more than 100 languages.

The model’s encoder is pre-trained on a vast unlabeled multilingual dataset of 12 million hours that covers over 300 languages and fine-tuning on a smaller labelled dataset. Multilingual pre-training with random-projection quantization and speech-text modality matching is used to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.

The team also demonstrates that despite using a labelled training set one-seventh the size of that used for the Whisper model, their model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

Training Process

The training process involves three different types of datasets to achieve their goal.

The process of training is divided into three stages. In the initial phase, a conformer backbone is trained using a vast speech dataset that has not been labelled. The objective is to optimize the BEST-RQ. In the next phase, the speech representation learning model is trained further, aiming to optimize several objectives such as BEST-RQ on unlabeled speech, modality matching, supervised ASR, duration modelling losses on paired speech, and transcript data, as well as the text reconstruction objective with an RNN-T decoder on unlabeled text. The final phase involves fine-tuning the encoder that was pre-trained in the earlier stages, for the ASR or AST tasks.

Firstly, they use unpaired audio datasets, including the YT-NTL-U, which is a massive collection of over 12 million hours of audio content sourced from YouTube, in over 300 different languages, without labels. The second unpaired audio dataset is Pub-U, which contains over 429,000 hours of speech content in 51 languages, sourced from public datasets, without labels.

Secondly, the team uses an unpaired text dataset called Web-NTL, which contains more than 28 billion sentences in over 1,140 different languages.

Lastly, the team utilizes two paired ASR data corpora, each containing over 10,000 hours of audio content with matching text for supervised training. The first corpus is YT-SUP+, which includes 90,000 hours of labelled data from 73 languages and an additional 100,000 hours of pseudo-labelled en-US data generated using the Noisy Student Training (NST) technique from the YT-NTL-U dataset. The second corpus is Pub-S, which includes 10,000 hours of labelled data from multi-domain en-US public sources, and an additional 10,000 hours of labelled data from public sources in over 102 languages.

The process for building 2B-parameter Conformer models using various datasets involves the following steps:

The encoder of the model is pre-trained using BEST-RQ, which is based on BERT and incorporates Randomprojection Quantizer. This unsupervised pre-training is performed using YT-NTL-U.

The model can be further prepared through a multi-objective supervised pre-training pipeline called MOST, which utilizes a combination of three datasets: YT-NTL-U, Pub-U, Web-NTL, and Pub-S. During this process, the BEST-RQ masked language model loss is combined with text-injection losses, including supervised ASR loss and modality matching losses. These losses are optimized through a weighted sum during training.

To prepare the model for downstream tasks, generic ASR models are trained using connectionist temporal classification (CTC) and Listen, Attend, and Spell (LAS) transducers in supervised ASR training.

Results

The team’s USM models have achieved excellent results in multilingual ASR and AST for various datasets in different domains. These include SpeechStew (ASR for a single language), CORAAL (ASR for African American Vernacular English), FLEURS (multilingual ASR), YT (long-form ASR for multiple languages), and CoVoST (AST from English to multiple languages). We also developed an ASR model for YouTube captions that performs better than Whisper, a general ASR system trained on more transcribed data, in 18 selected languages.

They also found that BEST-RQ pre-training is an effective method for scaling speech representation learning to large datasets. When combined with text injection in MOST, it improves the quality of downstream speech tasks, achieving state-of-the-art performance on the FLEURS and CoVoST 2 benchmarks.

The MOST representations can adapt quickly to new domains by training lightweight residual adapter modules, which add only 2% more parameters while keeping the rest of the model frozen.

To improve the performance of ASR models on long-form speech inputs, the team also introduced chunk-wise attention, a scalable method that produces high-quality transcripts for long utterances in the YouTube evaluation sets when used with USM-CTC/LAS models.

Meanwhile, last year, NVIDIA’s NeMo framework was used to develop a Telugu automatic speech recognition (ASR) model, which won IIIT-Hyderabad and Telugu ASR Challenge competitions with a low word error rate. The toolkit is open-source and supports multi-GPU and multi-node cluster training. The team preprocessed the data, removed errors, and trained the model for 160 epochs using NeMo, while fine-tuning pre-trained models using an NVIDIA DGX system for the open track. The results showed superiority over other ASR frameworks, achieving almost 2% fewer word errors than the runner-up.

Access all our open Survey & Awards Nomination forms in one place