MITB Banner

What Happened When Google Threw All Voice Data To The Blender. Answer: SpeechStew

SpeechStew model is trained on a combination of datasets to achieve state-of-the-art results on various speech recognition benchmarks.

Share

SpeechStew

Training large models is a massive challenge as it requires collecting and annotating vast amounts of data. It is particularly challenging in the case of speech recognition models.

To overcome this challenge, a team from Google Research and Google Brain have introduced an AI model, SpeechStew. The model is trained on a combination of datasets to achieve state-of-the-art results on various speech recognition benchmarks. 

SpeechStew

The success of end-to-end speech recognition models has been directly linked to the abundance of training data and the use of large deep learning models. However, in noisy and low-resource datasets such as CHiME-6, these end-to-end methods struggle to achieve optimal results. 

Techniques such as multilingual training, multi-domain training, unsupervised pre-training, semi-supervised learning, and transfer learning are recommended to avoid overfitting and promote more generalisation.

The SpeechStew model applies multi-domain training and transfer learning to end-to-end speech recognition. This method does not introduce additional hyperparameter or external language model during inference. Broadly, it follows two main steps:

  • Combining all the available speech recognition without domain-dependent rebalancing or reweighing.
  • Training a single large neural network model containing 100 million to up to a billion parameters.

Here, the researchers combined all the available speech recognition data, both labelled and unlabelled, amounting to a total of 5,00o hours. The combined dataset included AMI dataset (with 100 hours of meeting recordings), Switchboard (2,000 hours of telephone calls), broadcast news (50 hours of television news), Common Voice crowdsourced from Mozilla, TED-Lium (450 hours of TED talks), and Librispeech (960 hours of audiobooks).

The SpeechStew model was then tested on several benchmarks, and it outperformed the previous models. The scientists also observed that with this technique, the team could perform more challenging tasks.

To test the transfer learning capabilities, the team fine-tuned the SpeechStew model to test on the Chime-6 dataset containing 40-hours of distant conversations recorded on microphones and achieved good accuracy. 

Performance

SpeechStew could achieve state-of-the-art or near state-of-the-art results across tasks (AMI, Common Voice, TED-Lium, WSJ, and Switchboard). The model also demonstrated strong transfer learning capabilities. The results for transfer learning were encouraging since Chime-6 is a particularly challenging task for an end-to-end speech recognition model, which generally suffers from overfitting issues. Training large models are expensive and impractical to do frequently.

The study proved a user could fine-tune a pretrained model on only a few thousand gradient steps to achieve good performance. The cost incurred was also low.

“SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9% WER without a language model, which compares to 38.6% WER to a strong HMM baseline with a language model,” the authors said.

Wrapping Up

The technique of mixing datasets to train neural networks is not new. However, this work is different from the previous models as the team could scale to much larger models.

SpeechStew leverages up to 1 billion parameter models to yield strong empirical results. However, it does not perform well for extensive models such as GPT-3, which contains 175 billion parameters.

“This simple technique of fine-tuning a general-purpose model to new downstream speech recognition tasks is simple, practical, yet shockingly effective. It is important to realise that the distribution of other sources of data does not perfectly match the dataset of interest. But as long as there is some common representation needed to solve both tasks, we can hope to achieve improved results by combining both datasets,” the team said in an interview.

Read the full paper here.

Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.