Recently, researchers from UC Berkeley introduced a new AI model that can convert silently mouthed words to audible speech. The task of digitally voicing silent speech is based on electromyography (EMG) sensor measurements that capture the muscle impulses. The researchers claimed that they are the first to train from EMG collected during the silently articulated speech.
According to the researchers, digitally voicing silent speech has a broad array of potential applications. They introduced a method of training on silent EMG by transferring audio targets from vocalised to silent signals. By using muscular sensor measurements of speech articulator movement, the researchers aimed to capture silent speech – utterances that have been articulated without producing sound.
How This Research Is Different
Previously, several researchers have tried to convert EMG signals to speech. However, those were focused on the artificial task of recovering audio from EMG that was recorded during the vocalised speech, rather than the end-goal task of generating from silent speech. This new research is different from the previous ones, as the researchers tried to generate audible speech from silent speech. They stated, “In particular, we focus on the task which we call digital voicing, or generating synthetic speech to be transmitted or played back.”
The Mechanism Behind
According to the researchers, the silent speeches are being detected using the electromyography (EMG). They collected the EMG measurements during both vocalised speech, which is normal speech production that has voicing, friction, and other speech sounds as well as silent speech, which is speech-like articulations that do not produce any sound.
In order to capture the information articulator movement, the researchers used surface electromyography (EMG). Surface EMG uses electrodes placed on top of the skin to measure electrical potentials caused by nearby muscle activity. By placing the electrodes around the face and neck, the researchers were able to capture signals from muscles in the speech articulators.
The researchers created a new dataset of silent and vocalised facial EMG measurements for this particular project. They collected the dataset of EMG signals and time-aligned audio from a single speaker during both silent and vocalised speech. The dataset contains nearly 20 hours of facial EMG signals from a single speaker.
They stated, “To our knowledge, the largest public EMG-speech dataset previously available contains just two hours of data, and many papers continue to use private datasets.” They added, “We hope that this public release will encourage development on the task and allow for fair comparisons between methods.”
The Tech Behind
According to the researchers, the method is built around a recurrent neural transduction model from EMG features to time-aligned speech features. In order to generate audio from predicted speech features, they used a WaveNet decoder, which generates the audio sample by sample conditioned on Mel-frequency cepstral coefficients (MFCC) speech features.
The initial step to convert the EMG input signals to audio outputs is by using a bidirectional LSTM to convert between featured versions of the signals. The LSTM model consists of three bidirectional LSTM layers with 1024 hidden units, followed by a linear projection to the speech feature dimension.
Applications Of This Model
The researchers stated that the AI model showed improved intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalised data.
This AI model has several important applications, such as:
- The model can enable speech-like communication without any produced sound.
- It can be used to create a device analogous to a Bluetooth headset that allows people to carry on phone conversations without disrupting those around them.
- It can be useful in settings where the environment is too loud to capture audible speech or where maintaining silence is important.
- Also, this AI tool can be used by people who are no longer able to produce audible speech, such as individuals whose larynx has been removed due to trauma or disease
- Digital voicing for silent speech can be useful as a component technology for creating silent speech-to-text systems, making silent speech accessible to devices and digital assistants by leveraging existing high-quality audio-based speech-to-text systems.
Read the paper here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
What's Your Reaction?
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box. Contact: firstname.lastname@example.org