Listen to this story
You are talking on the phone, or recording an audio, or just speaking to voice assistants like Google Assistant, Cortana, or Alexa. But the person on the other side of the call cannot hear you because you are in a crowded place, the recorded audio has a lot of background noise, or the “Hey, Alexa” call wasn’t picked up by your device because someone else started speaking.
All of these problems related to separating voices, informally referred to as the “cocktail party problem”, have been addressed using artificial intelligence and deep learning methods in recent years. But still, separating and inferring multiple simultaneous voices is a difficult problem to completely solve. Why is that?
To start, speech separation is extracting speech of the “wanted speaker” or “speaker of interest” from the overlapping mixture of speech from other speakers, also referred to as ‘noise’. In recent years, the advancements of automatic speech recognition (ASR) technology is making speech separation an important field of research.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Challenges and approaches
Download our Mobile App
The origin of speech recognition dates back to 1952, when Bell Laboratories researchers Stephen Balashek, R. Biddulph, and K. H. Davis released the first voice recognition device, called “Audrey”, that could recognise digits from a single voice. By the 1980s, plenty of progress had been made in this field with the introduction of the n-gram language model, which is a probabilistic language model that can predict the next item in a text or speech sequence using the Markov model’s (n-1)-order.
When these techniques were implemented by the researchers for multiple sound sources, there were two major challenges for accurate separation:
- Non-stationarity of the speech signals and the surrounding environment.
- Reverberation in the acoustic setting.
There are three most common approaches to tackle these challenges:
- Beamforming: This technique uses spatial information with linearly arranged microphones to assess the direction of each speaker towards the microphone array.
- Blind source separation: BSS is based on independent component analysis (ICA) and therefore depends on statistical independence of signals.
- Single channel speech separation: SCSS is a highly complicated technique that aims to separate and deconvolve independent and individual sources from a single-channel mixture.
Speech separation is essentially an advanced case of sound source separation. Humans have an “innate” ability to separate sound sources since childhood. However, though it looks intrinsic in humans, it is actually a case of conditioning—or, training—since birth to be able to separate the desired speech from the background noise. This is why humans can focus on a single sound source by merely turning towards it, even in the presence of background noise.
Applications and results
Recently, Meta AI researcher Eliya Nachmani with his team introduced SepIt, which can separate speech from 2,3,5, and 10 sources or speakers. It is a deep neural network that takes the approach of SCSS with a general upper bound (stopping criterion) which is obtained with Cramer-Rao bound that makes an assumption about the nature of segments of speech.
A similar approach was adopted by Yi Luo in 2018 by developing Conv-TasNet, which is a fully convolutional, end-to-end time-domain sound separation network with an encoder and decoder.
MIT-IBM Watson AI Lab researcher Chuang Gan with his team developed an AI tool that essentially matches sound and visuals of the same source to separate similar sounds. In the project, the researchers used synchronised audio-video tracks of musicians playing piano to recreate how humans use multiple sensors to infer information. “Multi-sensory processing is the precursor to embodied intelligence and AI systems that can perform more complicated tasks,” said MIT professor Antonio Torralba.
DeLiang Wang, in his paper about Supervised Speech Separation, talks about how the idea of training a model similar to how text-to-image models are trained with CNNs and combination of text and descriptive text, could greatly improve the case for speech separation as well. This argument makes a case to solve the cocktail party problem by comparing ASR scores and separation capabilities of each model against human speech intelligibility in the same conditions.
Wang proposes that instead of the traditional way of studying speech separation as a signal processing problem, it should be treated as a supervised learning problem. This way—if the model is trained using discriminative features and patterns of speakers and their speech—it might be possible to infer and remove noise from a recording. A paper by P. Nancy of Parisutham Institute of Technology also pointed out the need of training models on more acoustic conditions.
Why is the problem difficult to solve?
Meta AI’s ‘SepIt’ ranks highest in the leaderboard of progress in Speech Separation but still indicates improvements and further research in the field due to the size of the model. Though researchers are making great progress in the field of speech separation and recognition using various methods, the solution and the biggest challenge still is inferring sounds as separate sources of speech instead of a single speaker.
With training the algorithms and models of speech separation, progress could be made with the use of multiple sensory inferring—similar to humans as we saw in the above mentioned approaches. This can further improve efficiency and accuracy of speech recognition and separation-related innovations like voice assistants or even hearing aids.
Every source of sound has a different frequency, volume, and waveform through which each source can be identified and separated. But, this is far easier to say than to work on without sacrificing accuracy. Separating two speeches is far more challenging than understanding the speech of one speaker since the possible combinations are almost infinite.