Why Speech Separation is Such a Difficult Problem to Solve

Researchers are making great progress in the field of speech separation and recognition using various methods, but the solution and the biggest challenge still is inferring sounds as separate sources of speech instead of a single speaker.
Why speech separation is such a difficult problem to solve
Listen to this story

You are talking on the phone, or recording an audio, or just speaking to voice assistants like Google Assistant, Cortana, or Alexa. But the person on the other side of the call cannot hear you because you are in a crowded place, the recorded audio has a lot of background noise, or the “Hey, Alexa” call wasn’t picked up by your device because someone else started speaking. 

All of these problems related to separating voices, informally referred to as the “cocktail party problem”, have been addressed using artificial intelligence and deep learning methods in recent years. But still, separating and inferring multiple simultaneous voices is a difficult problem to completely solve. Why is that?

To start, speech separation is extracting speech of the “wanted speaker” or “speaker of interest” from the overlapping mixture of speech from other speakers, also referred to as ‘noise’. In recent years, the advancements of automatic speech recognition (ASR) technology is making speech separation an important field of research. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Challenges and approaches

The origin of speech recognition dates back to 1952, when Bell Laboratories  researchers Stephen Balashek, R. Biddulph, and K. H. Davis released the first voice recognition device, called “Audrey”, that could recognise digits from a single voice. By the 1980s, plenty of progress had been made in this field with the introduction of the n-gram language model, which is a probabilistic language model that can predict the next item in a text or speech sequence using the Markov model’s (n-1)-order.

When these techniques were implemented by the researchers for multiple sound sources, there were two major challenges for accurate separation: 

  • Non-stationarity of the speech signals and the surrounding environment.
  • Reverberation in the acoustic setting.

There are three most common approaches to tackle these challenges: 

  • Beamforming: This technique uses spatial information with linearly arranged microphones to assess the direction of each speaker towards the microphone array.
  • Blind source separation: BSS is based on independent component analysis (ICA) and therefore depends on statistical independence of signals.
  • Single channel speech separation: SCSS is a highly complicated technique that aims to separate and deconvolve independent and individual sources from a single-channel mixture.

Speech separation is essentially an advanced case of sound source separation. Humans have an “innate” ability to separate sound sources since childhood. However, though it looks intrinsic in humans, it is actually a case of conditioning—or, training—since birth to be able to separate the desired speech from the background noise. This is why humans can focus on a single sound source by merely turning towards it, even in the presence of background noise.

Applications and results

Recently, Meta AI researcher Eliya Nachmani with his team introduced SepIt, which can separate speech from 2,3,5, and 10 sources or speakers. It is a deep neural network that takes the approach of SCSS with a general upper bound (stopping criterion) which is obtained with Cramer-Rao bound that makes an assumption about the nature of segments of speech.

A similar approach was adopted by Yi Luo in 2018 by developing Conv-TasNet, which is a fully convolutional, end-to-end time-domain sound separation network with an encoder and decoder. 

MIT-IBM Watson AI Lab researcher Chuang Gan with his team developed an AI tool that essentially matches sound and visuals of the same source to separate similar sounds. In the project, the researchers used synchronised audio-video tracks of musicians playing piano to recreate how humans use multiple sensors to infer information. “Multi-sensory processing is the precursor to embodied intelligence and AI systems that can perform more complicated tasks,” said MIT professor Antonio Torralba.

DeLiang Wang, in his paper about Supervised Speech Separation, talks about how the idea of training a model similar to how text-to-image models are trained with CNNs and combination of text and descriptive text, could greatly improve the case for speech separation as well. This argument makes a case to solve the cocktail party problem by comparing ASR scores and separation capabilities of each model against human speech intelligibility in the same conditions.

Wang proposes that instead of the traditional way of studying speech separation as a signal processing problem, it should be treated as a supervised learning problem. This way—if the model is trained using discriminative features and patterns of speakers and their speech—it might be possible to infer and remove noise from a recording. A paper by P. Nancy of Parisutham Institute of Technology also pointed out the need of training models on more acoustic conditions.

Why is the problem difficult to solve?

Meta AI’s ‘SepIt’ ranks highest in the leaderboard of progress in Speech Separation but still indicates improvements and further research in the field due to the size of the model. Though researchers are making great progress in the field of speech separation and recognition using various methods, the solution and the biggest challenge still is inferring sounds as separate sources of speech instead of a single speaker.

With training the algorithms and models of speech separation, progress could be made with the use of multiple sensory inferring—similar to humans as we saw in the above mentioned approaches. This can further improve efficiency and accuracy of speech recognition and separation-related innovations like voice assistants or even hearing aids.

Every source of sound has a different frequency, volume, and waveform through which each source can be identified and separated. But, this is far easier to say than to work on without sacrificing accuracy. Separating two speeches is far more challenging than understanding the speech of one speaker since the possible combinations are almost infinite.

More Great AIM Stories

Mohit Pandey
Mohit is a technology journalist who dives deep into the Artificial Intelligence and Machine Learning world to bring out information in simple and explainable words for the readers. He also holds a keen interest in photography, filmmaking, and the gaming industry.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM