Listen to this story
Over the years, Meta has been an avid contributor to the open-source community with their back-to-back impactful research papers. The most cited paper of 2022 was Google DeepMind’s AlphaFold. During the same year, Meta secured the third position with their paper ‘A ConvNet for the 2020s’, a collaborative effort with UC Berkeley, which garnered a remarkable 835 citations.
Taking the legacy ahead, Meta has presented more than 20 brilliant papers at the prestigious conference of the International Speech Communication Association (INTERSPEECH 2023) in Dublin. Let’s take a look at the top six of them.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Multi-head State Space Model for Speech Recognition
The paper introduces a novel approach called the multi-head state space (MH-SSM) architecture, enhanced with specialised gating mechanisms that leverage parallel heads to capture both local and global temporal patterns within sequence data. This MH-SSM model serves as a replacement for multi-head attention in transformer encoders, surpassing the performance of the transformer transducer on the LibriSpeech speech recognition dataset.
Moreover, the paper presents the Stateformer, a model incorporating MH-SSM layers into the transformer block. This Stateformer achieves state-of-the-art results on the LibriSpeech task, achieving word error rates of 1.76% and 4.37% on development sets and 1.91% and 4.36% on test sets, all without relying on an external language model.
Read the full paper here.
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding
This method employs a single model that combines audio and text data from pre-trained speech recognition models, outperforming traditional SLU systems in real-time on-device scenarios. However, these End-to-end (E2E) spoken language understanding (SLU) systems struggle when faced with poor text representations due to errors in automatic speech recognition (ASR).
To address this, Meta proposes a new E2E SLU system that enhances resilience to ASR errors by merging audio and text data based on estimated confidence levels of ASR hypotheses through two new techniques: 1) a method to gauge the quality of ASR hypotheses, and 2) an approach to effectively incorporate them into E2E SLU models. The method demonstrates improved accuracy on the STOP dataset, backed by analysis showcasing its effectiveness.
You can check out the full paper here.
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
Meta has come up with ‘Expresso’, a dataset with scripted and improvised speech in 26 styles to tackle the use of self-learned low bitrate units for speech synthesis, capturing intricate speech aspects even though there is a lack of expressive datasets. They use the dataset for a benchmark where input is encoded into low-bitrate units and then resynthesized in a target voice while preserving content and style. Resynthesis quality is assessed using self-supervised encoders, considering tradeoffs between quality, bitrate, and style consistency. The dataset, metrics, and models are open source for further research.
Check out this link for further understanding.
Handling the Alignment for Wake Word Detection: A Comparison Between Alignment-Based, Alignment-Free & Hybrid Approaches
The paper discusses wake word detection in smart devices, enabling them to activate efficiently upon hearing specific keywords. It explores alignment’s role in creating a wake-word system for general phrases, comparing three approaches: alignment-based training with frame-wise cross-entropy, alignment-free training using Connectionist Temporal Classification (CTC), and a hybrid approach combining aligned and unaligned data. Results show that the alignment-free system performs better for the target operating point, and the hybrid model, trained with a small portion of data (20%), meets performance criteria effectively.
For more information, read the full paper here.
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Meta has unveiled a new benchmark called MuAViC (Multilingual Audio-Visual Corpus) that incorporates audio-visual learning to achieve highly accurate speech translation, revamping speech translation. Based on their previous AI models such as AV-HuBERT and RAVen models that use visual information to improve English speech recognition, through MuAViC, Meta AI has trained its AV-HuBERT model to deliver superior speech translation in challenging noisy environments.
The model can effortlessly handle noise, with the visual modality being relied upon more heavily if the audio modality is distorted. The models were tested in noisy and noise-free environments against a top-performing model for speech recognition and X-En speech translation tasks.
Read the full paper here.
ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding
The paper discusses recent advancements in integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Notable improvements over prior ESPnet-SE work are highlighted, incorporating state-of-the-art speech enhancement models with associated training and evaluation methods. A novel interface has been devised, enabling the flexible combination of speech enhancement with other tasks like automatic speech recognition (ASR), speech translation (ST), and spoken language understanding (SLU).
The study includes experiments on specially curated synthetic datasets for tasks involving noisy-reverberant multichannel ST and SLU, serving as reference datasets for future research. Additionally, established datasets CHiME-4 and WSJ0-2Mix are utilised to assess both multi and single-channel SE techniques. Findings emphasise the promising potential of integrating SE front-ends with various tasks beyond ASR, particularly in multi-channel settings. Furthermore, the paper introduces multichannel ST and SLU datasets.
Take a look at the complete paper here.