Training large models is a massive challenge as it requires collecting and annotating vast amounts of data. It is particularly challenging in the case of speech recognition models.
To overcome this challenge, a team from Google Research and Google Brain have introduced an AI model, SpeechStew. The model is trained on a combination of datasets to achieve state-of-the-art results on various speech recognition benchmarks.
SpeechStew
The success of end-to-end speech recognition models has been directly linked to the abundance of training data and the use of large deep learning models. However, in noisy and low-resource datasets such as CHiME-6, these end-to-end methods struggle to achieve optimal results.
Techniques such as multilingual training, multi-domain training, unsupervised pre-training, semi-supervised learning, and transfer learning are recommended to avoid overfitting and promote more generalisation.
The SpeechStew model applies multi-domain training and transfer learning to end-to-end speech recognition. This method does not introduce additional hyperparameter or external language model during inference. Broadly, it follows two main steps:
- Combining all the available speech recognition without domain-dependent rebalancing or reweighing.
- Training a single large neural network model containing 100 million to up to a billion parameters.
Here, the researchers combined all the available speech recognition data, both labelled and unlabelled, amounting to a total of 5,00o hours. The combined dataset included AMI dataset (with 100 hours of meeting recordings), Switchboard (2,000 hours of telephone calls), broadcast news (50 hours of television news), Common Voice crowdsourced from Mozilla, TED-Lium (450 hours of TED talks), and Librispeech (960 hours of audiobooks).
The SpeechStew model was then tested on several benchmarks, and it outperformed the previous models. The scientists also observed that with this technique, the team could perform more challenging tasks.
To test the transfer learning capabilities, the team fine-tuned the SpeechStew model to test on the Chime-6 dataset containing 40-hours of distant conversations recorded on microphones and achieved good accuracy.
Performance
SpeechStew could achieve state-of-the-art or near state-of-the-art results across tasks (AMI, Common Voice, TED-Lium, WSJ, and Switchboard). The model also demonstrated strong transfer learning capabilities. The results for transfer learning were encouraging since Chime-6 is a particularly challenging task for an end-to-end speech recognition model, which generally suffers from overfitting issues. Training large models are expensive and impractical to do frequently.
The study proved a user could fine-tune a pretrained model on only a few thousand gradient steps to achieve good performance. The cost incurred was also low.
“SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9% WER without a language model, which compares to 38.6% WER to a strong HMM baseline with a language model,” the authors said.
Wrapping Up
The technique of mixing datasets to train neural networks is not new. However, this work is different from the previous models as the team could scale to much larger models.
SpeechStew leverages up to 1 billion parameter models to yield strong empirical results. However, it does not perform well for extensive models such as GPT-3, which contains 175 billion parameters.
“This simple technique of fine-tuning a general-purpose model to new downstream speech recognition tasks is simple, practical, yet shockingly effective. It is important to realise that the distribution of other sources of data does not perfectly match the dataset of interest. But as long as there is some common representation needed to solve both tasks, we can hope to achieve improved results by combining both datasets,” the team said in an interview.
Read the full paper here.