Meta’s 6 Premier Papers Presented at INTERSPEECH 2023

At the annual conference of the International Speech Communication Association, Meta presented more than 20 papers primarily focusing on NLP
Listen to this story

Over the years, Meta has been an avid contributor to the open-source community with their back-to-back impactful research papers. The most cited paper of 2022 was Google DeepMind’s AlphaFold. During the same year, Meta secured the third position with their paper ‘A ConvNet for the 2020s’, a collaborative effort with UC Berkeley, which garnered a remarkable 835 citations.

Taking the legacy ahead, Meta has presented more than 20 brilliant papers at the prestigious conference of the International Speech Communication Association (INTERSPEECH 2023) in Dublin. Let’s take a look at the top six of them. 

Read more: OpenAI’s Tiny Army vs Meta-Google’s Dream Team

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Multi-head State Space Model for Speech Recognition  

The paper introduces a novel approach called the multi-head state space (MH-SSM) architecture, enhanced with specialised gating mechanisms that leverage parallel heads to capture both local and global temporal patterns within sequence data. This MH-SSM model serves as a replacement for multi-head attention in transformer encoders, surpassing the performance of the transformer transducer on the LibriSpeech speech recognition dataset.

Moreover, the paper presents the Stateformer, a model incorporating MH-SSM layers into the transformer block. This Stateformer achieves state-of-the-art results on the LibriSpeech task, achieving word error rates of 1.76% and 4.37% on development sets and 1.91% and 4.36% on test sets, all without relying on an external language model.

Read the full paper here. 

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding 

This method employs a single model that combines audio and text data from pre-trained speech recognition models, outperforming traditional SLU systems in real-time on-device scenarios. However, these  End-to-end (E2E) spoken language understanding (SLU) systems struggle when faced with poor text representations due to errors in automatic speech recognition (ASR).

To address this, Meta proposes a new E2E SLU system that enhances resilience to ASR errors by merging audio and text data based on estimated confidence levels of ASR hypotheses through two new techniques: 1) a method to gauge the quality of ASR hypotheses, and 2) an approach to effectively incorporate them into E2E SLU models. The method demonstrates improved accuracy on the STOP dataset, backed by analysis showcasing its effectiveness.

You can check out the full paper here

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Meta has come up with ‘Expresso’, a dataset with scripted and improvised speech in 26 styles to tackle the use of self-learned low bitrate units for speech synthesis, capturing intricate speech aspects even though there is a lack of expressive datasets. They use the dataset for a benchmark where input is encoded into low-bitrate units and then resynthesized in a target voice while preserving content and style. Resynthesis quality is assessed using self-supervised encoders, considering tradeoffs between quality, bitrate, and style consistency. The dataset, metrics, and models are open source for further research. 

Check out this link for further understanding.

Handling the Alignment for Wake Word Detection: A Comparison Between Alignment-Based, Alignment-Free & Hybrid Approaches

The paper discusses wake word detection in smart devices, enabling them to activate efficiently upon hearing specific keywords. It explores alignment’s role in creating a wake-word system for general phrases, comparing three approaches: alignment-based training with frame-wise cross-entropy, alignment-free training using Connectionist Temporal Classification (CTC), and a hybrid approach combining aligned and unaligned data. Results show that the alignment-free system performs better for the target operating point, and the hybrid model, trained with a small portion of data (20%), meets performance criteria effectively.

For more information, read the full paper here. 

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

Meta has unveiled a new benchmark called MuAViC (Multilingual Audio-Visual Corpus) that incorporates audio-visual learning to achieve highly accurate speech translation, revamping speech translation. Based on their previous AI models such as AV-HuBERT and RAVen models that use visual information to improve English speech recognition, through MuAViC, Meta AI has trained its AV-HuBERT model to deliver superior speech translation in challenging noisy environments.

The model can effortlessly handle noise, with the visual modality being relied upon more heavily if the audio modality is distorted. The models were tested in noisy and noise-free environments against a top-performing model for speech recognition and X-En speech translation tasks.

Read the full paper here.  

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

The paper discusses recent advancements in integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Notable improvements over prior ESPnet-SE work are highlighted, incorporating state-of-the-art speech enhancement models with associated training and evaluation methods. A novel interface has been devised, enabling the flexible combination of speech enhancement with other tasks like automatic speech recognition (ASR), speech translation (ST), and spoken language understanding (SLU).

The study includes experiments on specially curated synthetic datasets for tasks involving noisy-reverberant multichannel ST and SLU, serving as reference datasets for future research. Additionally, established datasets CHiME-4 and WSJ0-2Mix are utilised to assess both multi and single-channel SE techniques. Findings emphasise the promising potential of integrating SE front-ends with various tasks beyond ASR, particularly in multi-channel settings. Furthermore, the paper introduces multichannel ST and SLU datasets. 

Take a look at the complete paper here.

Read more: Google’s 6 Must-Read Papers Published at INTERSPEECH 2023

Shritama Saha
Shritama Saha is a technology journalist who is keen to learn about AI and analytics play. A graduate in mass communication, she is passionate to explore the influence of data science on fashion, drug development, films, and art.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox