Active Hackathon

Facebook Is Giving Away This Speech Recognition Model For Free

Researchers at Facebook AI recently introduced and open-sourced a new framework for self-supervised learning of representations from raw audio data known as wav2vec 2.0. The company claims that this framework can enable automatic speech recognition models with just 10 minutes of transcribed speech data.

Neural network models have gained much traction over the last few years due to its applications across various sectors. The models work with the help of vast quantities of labelled training data. However, most of the time, it is challenging to gather labelled data than unlabelled data.


Sign up for your weekly dose of what's up in emerging technology.

The current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance. There are around 7,000 languages in the world and many more dialects. It can be said that the availability of the transcribed speech for a vast majority of languages is still negative.

To mitigate such issues, researchers open-sourced the wave2vec framework. The framework has the capability to make efficient development in Automatic Speech Recognition (ASR) for the low-resource languages.

How wav2vec 2.0 Works

The successor of wav2vec model, wav2vec 2.0 model learns basic speech units that are used to tackle a self-supervised task and is trained to predict the correct speech unit for masked parts of the audio while learning the speech units at the same time. 

wav2vec 2.0 utilises a self-supervision method to push the boundaries by learning from unlabelled training data to enable speech recognition systems for many more languages, dialects, and domains.

In technical terms, wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantisation of the latent representations which are jointly learned.

Wav2vec 2.0 & Other Models

Similar to masked language modelling, this framework encodes the speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations. 

The latent representations are then fed to a Transformer network to build contextualised representations, and the model is trained via a contrastive task where the true latent is to be distinguished from distractors. 

Also, similar to the famous BERT (Bidirectional Encoder Representations from Transformers) model, the new wav2vec 2.0 model is trained by predicting speech units for masked parts of the audio. 

One major drawback in BERT is that speech audio is a continuous signal that captures many aspects of the recording with no precise segmentation into words or other units. Wav2vec 2.0 tackles this issue by learning basic units that are 25ms long to enable learning of high-level contextualised representations. 

These units are then used to describe many different speech audio recordings and make wav2vec more robust. This feature helped the researchers to build speech recognition systems that can outperform the best-semi supervised methods, even with 100x less labelled training data.

Wrapping Up

According to a blog post, with just 10 minutes of transcribed speech along with 53K hours of unlabelled speech, this new model enables speech recognition models at a word error rate (WER) of 8.6 per cent on noisy speech and 5.2 per cent on clean speech on the standard LibriSpeech benchmark.

In this research, the researchers showed that speech recognition models can be built with very small amounts of annotated data at very good accuracy. According to the researchers, this model has opened the door for speech recognition models in many more languages, dialects and domains that previously required loads of transcribed audio data to provide acceptable accuracy.

Developers in a blog post stated, “Wav2vec 2.0 is part of our vision for machine learning models that rely less on labelled data, thanks to self-supervised learning.” They added, “We hope that the algorithm will enable improved speech technology for many more languages, dialects, and domains, and lead to improvements for existing systems.”

The code and pre-trained models are made available by the researchers at GitHub.

Read the paper here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
How Data Science Can Help Overcome The Global Chip Shortage

China-Taiwan standoff might increase Global chip shortage

After Nancy Pelosi’s visit to Taiwan, Chinese aircraft are violating Taiwan’s airspace. The escalation made TSMC’s chairman go public and threaten the world with consequences. Can this move by China fuel a global chip shortage?

Another bill bites the dust

The Bill had faced heavy criticism from different stakeholders -citizens, tech firms, political parties since its inception

So long, Spotify

‘TikTok Music’ is set to take over the online streaming space, but there exists an app that has silently established itself in the Indian market.