Language models like BERT and GPT-3 took the AI world by a storm and have been the talk of the data science town ever since. These models are a great innovation in the playing field of text-based models, yet, to the best of popular knowledge, no models have been created solely based on audio. Well, until now. Enter the field, Facebook AI and its GSLM.
Facebook AI claims to have created the next groundbreaking language model – the Generative Spoken Language Model (GSLM) to overcome this impediment. According to FB AI, this is the first high-performance NLP model independent of the text. GSLM can work directly from raw audio signals, with possible speech input to speech output, without needing labels or text, broadening the horizons for textless NLP in various oral languages.
Sign up for your weekly dose of what's up in emerging technology.
The Model Architecture
The GSLM model consists of three components:
- An encoder to convert speech into units based on the sound’s frequency.
- An autoregressive unit-based model to predict the next unit based on its history.
- A decoder to convert units into speech.
Behind the GSLM model were AI researchers with expertise in signal processing, speech processing, NLP, and psycholinguistics. The team believes that given the model’s access to the expressions of oral language, it can open up possibilities for language models in all languages as well as “incorporating nuances and intonations; encode irony, anger, and uncertainty; and use vocalisations like laughter, yawning, and mouth clicks”. The model also has the potential to be able to be trained on audio-first experiences like podcasts and radio shows without training an ASR.
The Baseline Model
GSLM evaluates the baseline model on two end-to-end tasks; a discrete resynthesis and a speech generation task. The team further tested their encoders – CPC, wav2vec 2.0, and HuBERT, and decoders – a standard language modelling Transformer and Tactron 2; following it up by k-means clustering and deduplication. These were trained using self-supervision from raw audio, while the language model, in specific, was trained on raw audio derived from the pseudo text.
Model Architecture: FB AI
The team leveraged a pretrained ASR to convert the generated audio back to text. They used PER to measure the intelligibility of the re-synthesised audio, the linguistic quality and diversity of the conditional audio through an area under the curve (AUC) metric. The sentences are sampled across a range of ‘temperatures’, a measure of the language model’s degree of invention. This process helps the team arrive at the AUC. Lower temperature points to a more rigid model, while a higher temperature is a more variable model.
A lower temperature causes repetitive sentences, while a medium temperature makes sentences locally coherent. Lastly, the sentences become incoherent with a high temperature and sometimes aren’t even composed of actual words.
For instance, a generated continuation example by FB AI looks at the model sentence “This reality begins to explain the dark pow[..]
At a medium temperature, HuBERT 100 can complete the word pow[..] to POWER and continue the sentence using similarly thematic words like dark inspired BLACKNESS.
The continuation of the quick reads, “THIS REALITY BEGINS TO EXPLAIN THE DARK POWER OF THE MAGICAL BLACKNESS AND IN THE MIDST OF IT IS MAGICAL AS A SINGLE BLACKNESS OF THE PAIN.”
The team realised that the number of discrete units used by the quantiser is important since a higher number yields better outcomes and higher rates. This trend continues on the linguistic level but has a limit at one too many units, after which it becomes detrimental. The outcome also differs between encoders, with HuBERT being found to provide the best overall results. Lastly, the team discovered that automatic generation metrics positively correlated with people, and these metrics were predicted by faster-to-compute zero-shot metrics using the Zero Resource Speech Benchmark.
The team acquired a latent representation by training a variational autoencoder leveraging vector quantisation. Dubbed VQ-VAE, this system is input with pitch information and simplified text-to-speech system inputs.
This architecture got the approval of the LJspeech and VCTK for objective metrics and subjective evaluation scores. To further improve the model, the team incorporated prosody in the LM, and to better its performance; they added extra channels in the GSLM. As a result, GSLM can generate multiple realistic prosodic inpainting for the same prompt, novel content and prosody congruently with the prompt’s expressive style.
Find more demonstrations here: https://speechbot.github.io/pgslm