Facebook recently introduced a generative spoken language model (GSLM) called textless NLP. It is one of the first high-performance NLP models that break free the dependence on text — unlike language models such as RoBERTa, BERT, and GPT-3, which are restricted to languages with very large text datasets.
GSLM uses the latest breakthroughs in representation learning, allowing it to work directly from raw audio signals, without any text or labels. According to Facebook, this opens the door to a new era of textless NLP applications for potentially every language spoken on Earth — even those without significant or limited text datasets. In addition, it enables the development of NLP models that incorporate the full range of expressivity of oral language.
Check out the code and pretrained models related to textless NLP on GitHub.
How is textless NLP different?
In the past, connecting an NLP application to speech inputs meant that researchers had to first train an automatic speech recognition (ASR) system. It is often a resource-intensive operation as it introduces errors, encodes casual linguistic interactions poorly, and is available for just a handful of languages. With textless NLP, the researchers are making ASR obsolete and work in an end-to-end fashion, from the speech input to speech outputs.
The baseline GSLM consists of three parts:
- An encoder that converts ‘speech’ into ‘discrete units’ that frequently represent recurring sounds in spoken language (S2u)
- An autoregressive, unit-based language model that is trained to predict the next discrete unit based on what it has seen before (pseudo-text)
- A decoder that converts units into speech (u2S)
Advantages of Textless NLP
- Textless NLP technology opens up the possibility of training models for any spoken language.
- Because of the rich expressivity of oral languages, textless NLP may work better than using text for training models. The model can capture the full expressivity of oral languages, including nuances and intonations, encode irony, anger, and uncertainty, and use vocalizations like yawning, laughter, mouth clicks, etc.
- Researchers can train models on audio-first experiences like podcasts, radio shows, and social audio apps without annotation or training an ASR. It opens up the possibility of a set of applications never seen before, including online expressive translation for multilingual video games, content search, and summarisation from archived audio.
- It may help developmental psychologists and speech and language clinicians understand how infants and toddlers learn to speak and to understand how speech is affected by variances in linguistic input available in different languages.
Evaluating a Baseline Model
In the research paper ‘On generative spoken language modelling from raw audio,” Facebook AI researchers tested three SOTA encoders, namely CPC, wav2vec 2.0, and HuBERT, followed by k-means clustering and deduplication (removing successive identical units). Plus, they have used a standard causal ‘transformer’ for language modelling and Tacotron 2, a standard text-to-speech system, as a decoder.
Further, the researchers trained their encoder and unit-based language model on 6,000 hours of Libri-Light and Librispeech (a large collection of audiobooks), and the decoder on LJspeech and Librispeech. First, the entire stack was trained with self-supervised learning from raw audio, with no text or labels. Second, the language model and text-to-speech entities were trained on pseudo-text derived from that raw audio.
Comparing these different models, the researchers noticed that they could not analyze the generated pseudo-text because the units do not map one-to-one with letters or phonemes. So instead, they used pretrained ASR to convert the generated audio back to text. It enabled them to measure the intelligibility of the resynthesized audio using phoneme error rate (PER) and the linguistic quality and diversity of the conditional or unconditional generated audio using an area under the curve (AUC) metric.
PER is a comparison of the phonemes of the original input with the phonemes transcribed by the ASR. On the other hand, AUC is obtained by sampling sentences across a range of ‘temperatures,’ which are defined as the degree of the inventiveness of a language model. The higher the temperature, the more unsteady the model is; the lower the temperature, the more rigid a model.
Facebook researchers said that they discovered several things while performing these measurements:
- It matters how many ‘discrete units’ the quantizers use: a higher number results in better outcomes at the acoustic level.
- There is a similar trend at the linguistic level, but using too many units in certain areas becomes detrimental.
- Different encoders produced very different outcomes (HuBERT provided the best overall result).
- Autonomic generation metrics correlate well with people.
- These metrics were predicted by ‘faster-to-compute zero-shot’ metrics from the Zero Resource Speech Benchmark.
For instance, the automatic and human metrics (lower is better) for three encoders (CPC, wav2vec and HuBERT) are shown below, along with comparing LogMel, which are quantized using k-means on three dictionary sizes (50, 100, 200).
Check out more samples here.
In addition to this, Facebook researchers in a paper ‘text-free Prosody-Aware Generative Spoken Language Modeling‘, presented a prosody-aware generative spoken language model (pGSLM). This new model comprises a multi-stream transformer language model (MS-TLM) of speech, represented as a discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
In this study, the researchers have devised a series of metrics for prosody modelling and generation, and re-use metrics from GSLM for content modelling, and also generated natural, meaningful, and coherent speech that gives a spoken prompt. Check out the audio samples here.
Facebook researchers said that it would continue to apply GSLM to casual and spontaneous speech and dialogue datasets, where text-based methods and ASR struggle most. In addition, the team believes that their GSLM can be an effective method for pretraining downstream tasks trained with few available labelled or annotated data, like spoken summarization, information retrieval tasks, and sentiment analysis.
“Our goal is to leverage the tremendous advantages in expressivity and subtlety of meaning that oral language offers over written languages, which opens up an almost infinite collection of potential data for understanding human thought,” said the team.