Facebook AI Research (FAIR) has published a research paper introducing Hidden Unit BERT (HuBERT), their latest approach for learning self-supervised speech representations. According to FAIR, self-supervised techniques for speech recognition are currently limited due to three factors: First, the presence of multiple sound units in each input utterance; secondly, the absence of lexicons of input sound units during the pre-training phase; and finally, the observation that sound units have variable lengths with no explicit segmentation.
To tackle this problem, HuBERT utilises an offline k-means clustering algorithm and learns the structure of its (spoken) input by predicting the right cluster for masked audio segments. FAIR claims that HuBERT’s simplicity and stability makes it easily deployable for use cases in NLP and speech research.
HuBERT is inspired by FAIR’s DeepCluster method for self-supervised visual learning. DeepCluster is a clustering method introduced in 2018 that learns a neural network’s parameters and their cluster assignment—after which it groups these features using a standard clustering algorithm, called k-means. HuBERT further benefits from Google’s Bidirectional Encoder Representations from Transformers (BERT) by leveraging its masked prediction loss over sequences method to showcase the sequential nature of speech. A BERT model takes masked continuous speech features to predict pre-determined cluster assignments. This predictive loss is applied only over the masked regions, making the model learn high-level representations of unmasked inputs to correctly deduce the masked areas’ targets.
How does HuBERT work?
The HuBERT model learns both acoustic and language models from these continuous inputs. For this, the model first encodes unmasked audio inputs into meaningful continuous latent representations. These representations map to the classical acoustic modelling problem. The model then makes use of representation learning via Masked Prediction.
The HuBERT approach predicting hidden cluster assignments of the masked frames (MSK) y2, y3, y4 / Source: Facebook AI Research
The model seeks to reduce prediction error by capturing the long-range temporal relationships between the representations it has learned. Here, the consistency of the k-means mapping from audio inputs to discrete targets is just as important as their correctness since it allows the model to focus on modelling the sequential structure of input data. For instance, if an early clustering utterance cannot tell /k/ and /g/ sounds apart, it would lead to a single supercluster containing both these sounds. The prediction loss will then learn representations that model how other consonant and vowel sounds work with this supercluster while forming words. Through this newly learned representation, the clustering iteration will create better clusters.
FAIR pretrained HuBERT on the standard LibriSpeech 960 hours and the Libri-Light 60,000 hours and found that the model either matched or improved on Facebook’s state-of-the-art speech recognition AI wav2vec 2.0’s performance on fine-tuning subsets of 10 minutes, 1 hour, 10 hours, 100 hours and 960 hours. The experiments were conducted using two models of HuBERT: HuBERT L-LV60k and HuBERT XL-LV60k.
Source: Facebook AI Research
Facebook AI Research also tested HuBERT’s performance in language generation—which it says is essential for direct language modelling of speech signals without relying on lexical resources such as supervised labels.
Source: Facebook AI Research
Using Generative Spoken Learning Modeling (GSLM)—which involves learning the acoustic and linguistic characteristics of a language without text or labels—Facebook has begun using learned speech representations to synthesise speech from models like Contrastive Predictive Coding (CPC), Wav2Vec2.0 and HuBERT. HuBERT, in both automatic and human evaluations, generated samples that could compete in quality with top-line supervised character-based LogMel (LM).
Finally, FAIR also tested HuBERT using the MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) test, which conducts a codec listening test to evaluate output quality from lossy audio compression algorithms. Here, HuBERT came second only to uncompressed audio.
Source: Facebook AI Research
Many AI-powered speech recognition platforms have been working towards understanding and recognising speech simply through listening and interacting and without labels. For example, Facebook AI Research recently launched an AI that understood speech without labelled text. The tech giant also made its largest language database public to facilitate the development of speech recognition tools, explicitly concentrating on languages such as Swahili, where labelled data is scarce.
With HuBERT, Facebook claims, the AI research community can develop Natural Language Processing (NLP) systems they could train through audio instead of text samples. This will allow AI voice assistants to capture the expressivity of oral language and speak with the nuances and styles of an actual person speaking the language. Such technology will allow individuals who speak rare languages or dialects or languages with more limited literature than others to benefit from more inclusive speech recognition and translation applications.
Join Our Telegram Group. Be part of an engaging online community. Join Here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
I am an economics undergrad who loves drinking coffee and writing about technology and finance. I like to play the ukulele and watch old movies when I'm free.