Now Reading
This New AI Model Can Convert Silent Words Into Audible Speech

This New AI Model Can Convert Silent Words Into Audible Speech

Ambika Choudhury

Recently, researchers from UC Berkeley introduced a new AI model that can convert silently mouthed words to audible speech. The task of digitally voicing silent speech is based on electromyography (EMG) sensor measurements that capture the muscle impulses. The researchers claimed that they are the first to train from EMG collected during the silently articulated speech.

According to the researchers, digitally voicing silent speech has a broad array of potential applications. They introduced a method of training on silent EMG by transferring audio targets from vocalised to silent signals. By using muscular sensor measurements of speech articulator movement, the researchers aimed to capture silent speech – utterances that have been articulated without producing sound.

How This Research Is Different 

Previously, several researchers have tried to convert EMG signals to speech. However, those were focused on the artificial task of recovering audio from EMG that was recorded during the vocalised speech, rather than the end-goal task of generating from silent speech. This new research is different from the previous ones, as the researchers tried to generate audible speech from silent speech. They stated, “In particular, we focus on the task which we call digital voicing, or generating synthetic speech to be transmitted or played back.”

The Mechanism Behind

According to the researchers, the silent speeches are being detected using the electromyography (EMG). They collected the EMG measurements during both vocalised speech, which is normal speech production that has voicing, friction, and other speech sounds as well as silent speech, which is speech-like articulations that do not produce any sound.

In order to capture the information articulator movement, the researchers used surface electromyography (EMG). Surface EMG uses electrodes placed on top of the skin to measure electrical potentials caused by nearby muscle activity. By placing the electrodes around the face and neck, the researchers were able to capture signals from muscles in the speech articulators.

The researchers created a new dataset of silent and vocalised facial EMG measurements for this particular project. They collected the dataset of EMG signals and time-aligned audio from a single speaker during both silent and vocalised speech. The dataset contains nearly 20 hours of facial EMG signals from a single speaker. 

They stated, “To our knowledge, the largest public EMG-speech dataset previously available contains just two hours of data, and many papers continue to use private datasets.” They added, “We hope that this public release will encourage development on the task and allow for fair comparisons between methods.”

The Tech Behind

According to the researchers, the method is built around a recurrent neural transduction model from EMG features to time-aligned speech features. In order to generate audio from predicted speech features, they used a WaveNet decoder, which generates the audio sample by sample conditioned on Mel-frequency cepstral coefficients (MFCC) speech features.

See Also

The initial step to convert the EMG input signals to audio outputs is by using a bidirectional LSTM to convert between featured versions of the signals. The LSTM model consists of three bidirectional LSTM layers with 1024 hidden units, followed by a linear projection to the speech feature dimension. 

Applications Of This Model

The researchers stated that the AI model showed improved intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalised data. 

This AI model has several important applications, such as:

  • The model can enable speech-like communication without any produced sound. 
  • It can be used to create a device analogous to a Bluetooth headset that allows people to carry on phone conversations without disrupting those around them.
  • It can be useful in settings where the environment is too loud to capture audible speech or where maintaining silence is important.
  • Also, this AI tool can be used by people who are no longer able to produce audible speech, such as individuals whose larynx has been removed due to trauma or disease
  • Digital voicing for silent speech can be useful as a component technology for creating silent speech-to-text systems, making silent speech accessible to devices and digital assistants by leveraging existing high-quality audio-based speech-to-text systems.

Read the paper here.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
What's Your Reaction?
In Love
Not Sure

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top