Facebook’s First AI Model Based On Audio

Facebook AI Model

Language models like BERT and GPT-3 took the AI world by a storm and have been the talk of the data science town ever since. These models are a great innovation in the playing field of text-based models, yet, to the best of popular knowledge, no models have been created solely based on audio. Well, until now. Enter the field, Facebook AI and its GSLM.   

Facebook AI claims to have created the next groundbreaking language model – the Generative Spoken Language Model (GSLM) to overcome this impediment. According to FB AI, this is the first high-performance NLP model independent of the text. GSLM can work directly from raw audio signals, with possible speech input to speech output, without needing labels or text, broadening the horizons for textless NLP in various oral languages. 

The Model Architecture

The GSLM model consists of three components:


Sign up for your weekly dose of what's up in emerging technology.
  • An encoder to convert speech into units based on the sound’s frequency.
  • An autoregressive unit-based model to predict the next unit based on its history.
  • A decoder to convert units into speech. 

Behind the GSLM model were AI researchers with expertise in signal processing, speech processing, NLP, and psycholinguistics. The team believes that given the model’s access to the expressions of oral language, it can open up possibilities for language models in all languages as well as “incorporating nuances and intonations; encode irony, anger, and uncertainty; and use vocalisations like laughter, yawning, and mouth clicks”. The model also has the potential to be able to be trained on audio-first experiences like podcasts and radio shows without training an ASR. 

The Baseline Model

GSLM evaluates the baseline model on two end-to-end tasks; a discrete resynthesis and a speech generation task. The team further tested their encoders – CPC, wav2vec 2.0, and HuBERT, and decoders – a standard language modelling Transformer and Tactron 2; following it up by k-means clustering and deduplication. These were trained using self-supervision from raw audio, while the language model, in specific, was trained on raw audio derived from the pseudo text. 

Download our Mobile App

Model Architecture: FB AI

Model Temperature

The team leveraged a pretrained ASR to convert the generated audio back to text. They used PER to measure the intelligibility of the re-synthesised audio, the linguistic quality and diversity of the conditional audio through an area under the curve (AUC) metric. The sentences are sampled across a range of ‘temperatures’, a measure of the language model’s degree of invention. This process helps the team arrive at the AUC. Lower temperature points to a more rigid model, while a higher temperature is a more variable model. 

A lower temperature causes repetitive sentences, while a medium temperature makes sentences locally coherent. Lastly, the sentences become incoherent with a high temperature and sometimes aren’t even composed of actual words. 

For instance, a generated continuation example by FB AI looks at the model sentence “This reality begins to explain the dark pow[..]

At a medium temperature, HuBERT 100 can complete the word pow[..] to POWER and continue the sentence using similarly thematic words like dark inspired BLACKNESS. 



The team realised that the number of discrete units used by the quantiser is important since a higher number yields better outcomes and higher rates. This trend continues on the linguistic level but has a limit at one too many units, after which it becomes detrimental. The outcome also differs between encoders, with HuBERT being found to provide the best overall results. Lastly, the team discovered that automatic generation metrics positively correlated with people, and these metrics were predicted by faster-to-compute zero-shot metrics using the Zero Resource Speech Benchmark. 

Image: FB AI

The team acquired a latent representation by training a variational autoencoder leveraging vector quantisation. Dubbed VQ-VAE, this system is input with pitch information and simplified text-to-speech system inputs. 

This architecture got the approval of the LJspeech and VCTK for objective metrics and subjective evaluation scores. To further improve the model, the team incorporated prosody in the LM, and to better its performance; they added extra channels in the GSLM. As a result, GSLM can generate multiple realistic prosodic inpainting for the same prompt, novel content and prosody congruently with the prompt’s expressive style. 

Find more demonstrations here: https://speechbot.github.io/pgslm

Support independent technology journalism

Get exclusive, premium content, ads-free experience & more

Rs. 299/month

Subscribe now for a 7-day free trial

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox