MITB Banner

Meta’s 6 Premier Papers Presented at INTERSPEECH 2023

At the annual conference of the International Speech Communication Association, Meta presented more than 20 papers primarily focusing on NLP

Share

Listen to this story

Over the years, Meta has been an avid contributor to the open-source community with their back-to-back impactful research papers. The most cited paper of 2022 was Google DeepMind’s AlphaFold. During the same year, Meta secured the third position with their paper ‘A ConvNet for the 2020s’, a collaborative effort with UC Berkeley, which garnered a remarkable 835 citations.

Taking the legacy ahead, Meta has presented more than 20 brilliant papers at the prestigious conference of the International Speech Communication Association (INTERSPEECH 2023) in Dublin. Let’s take a look at the top six of them. 

Read more: OpenAI’s Tiny Army vs Meta-Google’s Dream Team

Multi-head State Space Model for Speech Recognition  

The paper introduces a novel approach called the multi-head state space (MH-SSM) architecture, enhanced with specialised gating mechanisms that leverage parallel heads to capture both local and global temporal patterns within sequence data. This MH-SSM model serves as a replacement for multi-head attention in transformer encoders, surpassing the performance of the transformer transducer on the LibriSpeech speech recognition dataset.

Moreover, the paper presents the Stateformer, a model incorporating MH-SSM layers into the transformer block. This Stateformer achieves state-of-the-art results on the LibriSpeech task, achieving word error rates of 1.76% and 4.37% on development sets and 1.91% and 4.36% on test sets, all without relying on an external language model.

Read the full paper here. 

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding 

This method employs a single model that combines audio and text data from pre-trained speech recognition models, outperforming traditional SLU systems in real-time on-device scenarios. However, these  End-to-end (E2E) spoken language understanding (SLU) systems struggle when faced with poor text representations due to errors in automatic speech recognition (ASR).

To address this, Meta proposes a new E2E SLU system that enhances resilience to ASR errors by merging audio and text data based on estimated confidence levels of ASR hypotheses through two new techniques: 1) a method to gauge the quality of ASR hypotheses, and 2) an approach to effectively incorporate them into E2E SLU models. The method demonstrates improved accuracy on the STOP dataset, backed by analysis showcasing its effectiveness.

You can check out the full paper here

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Meta has come up with ‘Expresso’, a dataset with scripted and improvised speech in 26 styles to tackle the use of self-learned low bitrate units for speech synthesis, capturing intricate speech aspects even though there is a lack of expressive datasets. They use the dataset for a benchmark where input is encoded into low-bitrate units and then resynthesized in a target voice while preserving content and style. Resynthesis quality is assessed using self-supervised encoders, considering tradeoffs between quality, bitrate, and style consistency. The dataset, metrics, and models are open source for further research. 

Check out this link for further understanding.

Handling the Alignment for Wake Word Detection: A Comparison Between Alignment-Based, Alignment-Free & Hybrid Approaches

The paper discusses wake word detection in smart devices, enabling them to activate efficiently upon hearing specific keywords. It explores alignment’s role in creating a wake-word system for general phrases, comparing three approaches: alignment-based training with frame-wise cross-entropy, alignment-free training using Connectionist Temporal Classification (CTC), and a hybrid approach combining aligned and unaligned data. Results show that the alignment-free system performs better for the target operating point, and the hybrid model, trained with a small portion of data (20%), meets performance criteria effectively.

For more information, read the full paper here. 

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

Meta has unveiled a new benchmark called MuAViC (Multilingual Audio-Visual Corpus) that incorporates audio-visual learning to achieve highly accurate speech translation, revamping speech translation. Based on their previous AI models such as AV-HuBERT and RAVen models that use visual information to improve English speech recognition, through MuAViC, Meta AI has trained its AV-HuBERT model to deliver superior speech translation in challenging noisy environments.

The model can effortlessly handle noise, with the visual modality being relied upon more heavily if the audio modality is distorted. The models were tested in noisy and noise-free environments against a top-performing model for speech recognition and X-En speech translation tasks.

Read the full paper here.  

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

The paper discusses recent advancements in integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Notable improvements over prior ESPnet-SE work are highlighted, incorporating state-of-the-art speech enhancement models with associated training and evaluation methods. A novel interface has been devised, enabling the flexible combination of speech enhancement with other tasks like automatic speech recognition (ASR), speech translation (ST), and spoken language understanding (SLU).

The study includes experiments on specially curated synthetic datasets for tasks involving noisy-reverberant multichannel ST and SLU, serving as reference datasets for future research. Additionally, established datasets CHiME-4 and WSJ0-2Mix are utilised to assess both multi and single-channel SE techniques. Findings emphasise the promising potential of integrating SE front-ends with various tasks beyond ASR, particularly in multi-channel settings. Furthermore, the paper introduces multichannel ST and SLU datasets. 

Take a look at the complete paper here.

Read more: Google’s 6 Must-Read Papers Published at INTERSPEECH 2023

Share
Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.