MITB Banner

Why Speech Separation is Such a Difficult Problem to Solve

Researchers are making great progress in the field of speech separation and recognition using various methods, but the solution and the biggest challenge still is inferring sounds as separate sources of speech instead of a single speaker.

Share

Why speech separation is such a difficult problem to solve

Illustration by As Melissa McCarthy's etiquette-minded character in Bridesmaids demonstrates, there's nothing wrong with wearing headphones on an airplane.

Listen to this story

You are talking on the phone, or recording an audio, or just speaking to voice assistants like Google Assistant, Cortana, or Alexa. But the person on the other side of the call cannot hear you because you are in a crowded place, the recorded audio has a lot of background noise, or the “Hey, Alexa” call wasn’t picked up by your device because someone else started speaking. 

All of these problems related to separating voices, informally referred to as the “cocktail party problem”, have been addressed using artificial intelligence and deep learning methods in recent years. But still, separating and inferring multiple simultaneous voices is a difficult problem to completely solve. Why is that?

To start, speech separation is extracting speech of the “wanted speaker” or “speaker of interest” from the overlapping mixture of speech from other speakers, also referred to as ‘noise’. In recent years, the advancements of automatic speech recognition (ASR) technology is making speech separation an important field of research. 

Challenges and approaches

The origin of speech recognition dates back to 1952, when Bell Laboratories  researchers Stephen Balashek, R. Biddulph, and K. H. Davis released the first voice recognition device, called “Audrey”, that could recognise digits from a single voice. By the 1980s, plenty of progress had been made in this field with the introduction of the n-gram language model, which is a probabilistic language model that can predict the next item in a text or speech sequence using the Markov model’s (n-1)-order.

When these techniques were implemented by the researchers for multiple sound sources, there were two major challenges for accurate separation: 

  • Non-stationarity of the speech signals and the surrounding environment.
  • Reverberation in the acoustic setting.

There are three most common approaches to tackle these challenges: 

  • Beamforming: This technique uses spatial information with linearly arranged microphones to assess the direction of each speaker towards the microphone array.
  • Blind source separation: BSS is based on independent component analysis (ICA) and therefore depends on statistical independence of signals.
  • Single channel speech separation: SCSS is a highly complicated technique that aims to separate and deconvolve independent and individual sources from a single-channel mixture.

Speech separation is essentially an advanced case of sound source separation. Humans have an “innate” ability to separate sound sources since childhood. However, though it looks intrinsic in humans, it is actually a case of conditioning—or, training—since birth to be able to separate the desired speech from the background noise. This is why humans can focus on a single sound source by merely turning towards it, even in the presence of background noise.

Applications and results

Recently, Meta AI researcher Eliya Nachmani with his team introduced SepIt, which can separate speech from 2,3,5, and 10 sources or speakers. It is a deep neural network that takes the approach of SCSS with a general upper bound (stopping criterion) which is obtained with Cramer-Rao bound that makes an assumption about the nature of segments of speech.

A similar approach was adopted by Yi Luo in 2018 by developing Conv-TasNet, which is a fully convolutional, end-to-end time-domain sound separation network with an encoder and decoder. 

MIT-IBM Watson AI Lab researcher Chuang Gan with his team developed an AI tool that essentially matches sound and visuals of the same source to separate similar sounds. In the project, the researchers used synchronised audio-video tracks of musicians playing piano to recreate how humans use multiple sensors to infer information. “Multi-sensory processing is the precursor to embodied intelligence and AI systems that can perform more complicated tasks,” said MIT professor Antonio Torralba.

DeLiang Wang, in his paper about Supervised Speech Separation, talks about how the idea of training a model similar to how text-to-image models are trained with CNNs and combination of text and descriptive text, could greatly improve the case for speech separation as well. This argument makes a case to solve the cocktail party problem by comparing ASR scores and separation capabilities of each model against human speech intelligibility in the same conditions.

Wang proposes that instead of the traditional way of studying speech separation as a signal processing problem, it should be treated as a supervised learning problem. This way—if the model is trained using discriminative features and patterns of speakers and their speech—it might be possible to infer and remove noise from a recording. A paper by P. Nancy of Parisutham Institute of Technology also pointed out the need of training models on more acoustic conditions.

Why is the problem difficult to solve?

Meta AI’s ‘SepIt’ ranks highest in the leaderboard of progress in Speech Separation but still indicates improvements and further research in the field due to the size of the model. Though researchers are making great progress in the field of speech separation and recognition using various methods, the solution and the biggest challenge still is inferring sounds as separate sources of speech instead of a single speaker.

With training the algorithms and models of speech separation, progress could be made with the use of multiple sensory inferring—similar to humans as we saw in the above mentioned approaches. This can further improve efficiency and accuracy of speech recognition and separation-related innovations like voice assistants or even hearing aids.

Every source of sound has a different frequency, volume, and waveform through which each source can be identified and separated. But, this is far easier to say than to work on without sacrificing accuracy. Separating two speeches is far more challenging than understanding the speech of one speaker since the possible combinations are almost infinite.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.