MITB Banner

Facebook Launches AI That Understands Language Without Labels

Share

“Machine learning models need to be trained with large amounts of transcribed speech audio. This process is effective but expensive.”

Decades ago, asking for basic directions or information would have been an arduous task. The advent of solutions such as Google Translate alleviated that problem to an extent. But, is this enough? For example, India alone is home to 23 official languages and thousands of unofficial ones and Google Translate supports just 11 of India’s languages. Other speech recognition technologies might allow even fewer languages. Also, languages such as Basque and Swahili are far likelier to have more limited AI speech recognition capabilities than Hindi, English and Mandarin.

Hence, the paths opened up by speech recognition technology are only available to a small fraction of the countless languages spoken all over the world. This is because most AI for speech recognition belongs to a category called supervised learning. Machine learning systems in such technologies need to be trained with large amounts of transcribed speech audio. This process is effective but expensive and needs to be repeated for each language. This data is not available on such a scale—if at all—for every language and dialect spoken on Earth. 

Facebook finds a way

In a recent blog post, Facebook revealed its new AI-based speech recognition technology, wav2vec-Unsupervised (or wav2vec-U), which aims to solve the problems posed by transcribing such languages. This is a method by which individuals could build speech recognition systems that do not require transcribed data. 

The ML algorithm still requires some form of training. Wav2vec-U is trained purely through recorded speech audio and unpaired text. This method entails first learning the structure of the target language’s speech from unlabelled audio. Using wav2vec 2.0, Facebook’s self-supervised speech recognition model, and a k-means clustering algorithm, wav2vec-U segments the voice recording into speech units loosely based on individual sounds. For instance, the word cat would correspond to the sounds: “/K/”, “/AE/”, and “/T/”. This allows it to comprehend the structure of this speech.

Source: Facebook AI

To recognise the words in an audio recording, Facebook will use a generative adversarial network (GAN) consisting of a generator and a discriminator network. The generator will take each audio segment embedded in self-supervised representations and predict a phoneme, i.e. a sound unit, corresponding to a sound in language. The discriminator would then decide whether the predicted phoneme sequence looks realistic. The initial transcriptions, as per Facebook, will be extremely poor but will improve—via the discriminator’s feedback—over time.

Source: Facebook AI

The discriminator, which is also a neural network, will be fed with the actual text from various sources that have been phonemised. This will allow it to learn to distinguish between the speech recognition output of the generator and real text. The generator predicted the sound to phonetically write ‘Tow are uu’, which we know is nonsensical. The discriminator, however, uses its knowledge of real-world phonemised text to make this sound legible, in which it ultimately succeeds. 

Facebook has conducted various tests to see how well wav2vec-U works. Facebook led the first evaluation on the TIMIT Acoustic-Phonetic Continuous Speech Corpus, a standard dataset used to evaluate automatic speech recognition systems. Here, Facebook’s model was said to have reduced error rates by 57 per cent compared to the following best-unsupervised method. Facebook also tested the wav2vec-U alongside supervised models on the broader Librispeech benchmark and found wav2vec-U as accurate as effective models from only a few years ago—even though it used no labelled training data. 

Source: Facebook AI

Finally, Facebook also tried their method on languages other than English because unsupervised speech recognition would be most impactful for languages without much-labelled data. The results displayed the effectiveness of wav2vec-U even with languages such as Kyrgyz and Tartar, which do not have too many data resources.

Wav2vec-U is the result of years of Facebook AI’s research on speech recognition and unsupervised machine translation. Facebook believes that the system would allow more people to benefit from speech technology. Facebook, however, has admitted that a lot more research is needed to address concerns such as possible bias prevalent in this system, especially regarding racial and gender stereotypes. Facebook has stated that it has not yet investigated this in their model but has said it might help avoid biases introduced through data labelling. Facebook is also releasing the code for wav2vec-U in open source to allow more developers to build speech recognition systems without the need for labelled data. 

PS: The story was written using a keyboard.
Picture of Mita Chaturvedi

Mita Chaturvedi

I am an economics undergrad who loves drinking coffee and writing about technology and finance. I like to play the ukulele and watch old movies when I'm free.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed