Researchers at Facebook AI recently introduced and open-sourced a new framework for self-supervised learning of representations from raw audio data known as wav2vec 2.0. The company claims that this framework can enable automatic speech recognition models with just 10 minutes of transcribed speech data.
Neural network models have gained much traction over the last few years due to its applications across various sectors. The models work with the help of vast quantities of labelled training data. However, most of the time, it is challenging to gather labelled data than unlabelled data.
The current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance. There are around 7,000 languages in the world and many more dialects. It can be said that the availability of the transcribed speech for a vast majority of languages is still negative.
To mitigate such issues, researchers open-sourced the wave2vec framework. The framework has the capability to make efficient development in Automatic Speech Recognition (ASR) for the low-resource languages.
How wav2vec 2.0 Works
The successor of wav2vec model, wav2vec 2.0 model learns basic speech units that are used to tackle a self-supervised task and is trained to predict the correct speech unit for masked parts of the audio while learning the speech units at the same time.
wav2vec 2.0 utilises a self-supervision method to push the boundaries by learning from unlabelled training data to enable speech recognition systems for many more languages, dialects, and domains.
In technical terms, wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantisation of the latent representations which are jointly learned.
Wav2vec 2.0 & Other Models
Similar to masked language modelling, this framework encodes the speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations.
The latent representations are then fed to a Transformer network to build contextualised representations, and the model is trained via a contrastive task where the true latent is to be distinguished from distractors.
Also, similar to the famous BERT (Bidirectional Encoder Representations from Transformers) model, the new wav2vec 2.0 model is trained by predicting speech units for masked parts of the audio.
One major drawback in BERT is that speech audio is a continuous signal that captures many aspects of the recording with no precise segmentation into words or other units. Wav2vec 2.0 tackles this issue by learning basic units that are 25ms long to enable learning of high-level contextualised representations.
These units are then used to describe many different speech audio recordings and make wav2vec more robust. This feature helped the researchers to build speech recognition systems that can outperform the best-semi supervised methods, even with 100x less labelled training data.
According to a blog post, with just 10 minutes of transcribed speech along with 53K hours of unlabelled speech, this new model enables speech recognition models at a word error rate (WER) of 8.6 per cent on noisy speech and 5.2 per cent on clean speech on the standard LibriSpeech benchmark.
In this research, the researchers showed that speech recognition models can be built with very small amounts of annotated data at very good accuracy. According to the researchers, this model has opened the door for speech recognition models in many more languages, dialects and domains that previously required loads of transcribed audio data to provide acceptable accuracy.
Developers in a blog post stated, “Wav2vec 2.0 is part of our vision for machine learning models that rely less on labelled data, thanks to self-supervised learning.” They added, “We hope that the algorithm will enable improved speech technology for many more languages, dialects, and domains, and lead to improvements for existing systems.”
The code and pre-trained models are made available by the researchers at GitHub.
Read the paper here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
What's Your Reaction?
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box. Contact: firstname.lastname@example.org