What is Google’s TRILLsson?

The 600M parameter Conformer model without close attention is called Conformer Applied to Paralinguistics (CAP).

Share

Published on March 16, 2022

by Avi Gopani

Deepfakes, chatbots, assistant robots, transformer models—these are some of the major ML developments shaping the last decade, but AI has not moved beyond models like GPT-3. AGI is still a dream.

How would you make a marriage stronger?

GPT-3: I would make sure that I was the one who was always right.

Today, most machine learning models still struggle to understand paralinguistic aspects. For example, they cannot fully grasp sarcasm, cultural context, tone, emotions or even if the speaker is wearing a mask. Another recurring issue is these state-of-the-art results are more popular from ultra-large models trained on private data. This creates a disconnect from convenient public usage.

Publicly available paralinguistic model

In the paper titled “Universal Paralinguistic Speech Representations Using Self-Supervised Conformers,” Google has introduced CAP12. The CAP12 is the 12th layer of a 600M parameter model trained on the YT-U training dataset using self-supervision. The model outperforms most of the paralinguistic benchmark, sometimes by large margins. In another paper, “TRILLsson: Distilled Universal Paralinguistic Speech Representations”, the big tech company introduced the TRILLsson models as small, performant and publicly available. The team claims to have reduced the size of CAP12 by 6x-100x while maintaining the performance. “To create TRILLsson, we apply knowledge distillation on appropriately-sized audio chunks and use different architecture types to train smaller, faster networks that are small enough to run on mobile devices,” they explained.

Training

The self-supervised CAP12 model was trained on the YT-U training dataset. The YT-U dataset was built on a random collection of YouTube videos and consists of 900M+ hours of audio on random topics with background settings and speaker acoustic attributes. The dataset is unlabeled and used to self-train Conformer models.

The team further modified a Wav2Vec 2.0 self-supervised training paradigm, a state-of-the-art model for Automatic Speech Recognition that can solve tasks using raw data without labels. It was integrated with ultra-large Conformer models and scaled up the YT-U dataset to huge model sizes of 600M, 1B, and 8B parameters. This scaling was accessible since self-training does not require labels.

The 600M parameter Conformer model without close attention is called Conformer Applied to Paralinguistics (CAP).

CAP12 excels on the NOSS Benchmark for Paralinguistic Tasks

The CAP12 model outperforms previous representations of six ultra-large models by significant margins, claims the team. The NOSS (NOn-Semantic Speech) benchmark was used to measure the quality of 300 paralinguistic speech representations. The NOSS benchmark contains well-studied paralinguistic speech tasks. It compares speech representations, including diverse datasets, and benchmark tasks like speech emotion recognition, language identification, and speaker identification. The benchmark was chosen given its evaluation of speech features on the order of 1 second or longer (over lexical features). It was further expanded with a mask-wearing task, a face detection task, level detection of dysarthria from project Euphonia task and a speech emotion recognition task.

The team demonstrated CAP12’s usefulness (over previous representations) on this expanded benchmark. The team found simple linear models on time-averaged CAP12 representations to outperform complex, task-specific models on five out of eight paralinguistic tasks. It is also exceptionally good at emotion recognition tasks.

Google’s TRILLsson

TRILLsson is an on-device, publicly available version of CAP12. The team leveraged knowledge distillation to train smaller, faster and mobile-friendly architectures. EfficientNet, Audio Spectrogram Transformer, and ResNet were used in the research, covering fixed-length and arbitrary-length inputs. EfficientNet is part of a neural architecture search over vision models that identify model structures that are performant and efficient. AST models are transformers adjusted to audio inputs. ResNet is a standard architecture showing performance across models.

Despite being 1%-15% the size of CAP and trained only on 6% of the data, they performed on average 90-96%. The team also identified different architecture types to perform better at different sizes; for instance, ResNet models excelled at the low end, EfficientNet at the middle, and AST models at the larger end.

The knowledge distillation techniques used were global matching and local matching. This was done to match a student with a fixed-size input to the output of a teacher with a variable size input. Global matching generates CAP12 embeddings for a whole audio clip and produces distillation targets. This is followed by a student needing to match the target from a small audio segment. Local matching demands the student network match the average CAP12 embedding over the smaller portion of the audio that the student sees. The stated research used local matching.

In conclusion

Paralinguistic information is bimodal in an unexpected way. The team noticed the intermediate representations gradually rise in paralinguistic information, only decreasing and increasing again. The model finally loses this information towards the output layer for the CAP model. “Surprisingly, this pattern is also seen when exploring the intermediate representations of networks trained on retinal images,” the team noted.

Such smaller and faster paralinguistic speech models open up speech recognition possibilities, text-to-speech production, and user intent interpretation possibilities.

Access all our open Survey & Awards Nomination forms in one place

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.