Listen to this story
The annual Conference of the International Speech Communication Association (INTERSPEECH 2023) is under way in Dublin from 20-24 August and Google is one of the notable contributors to the event. Natural Language Processing (NLP) has become the silent powerhouse in communication and understanding. From chatbots like ChatGPT to understanding the intricate threads of medical data for precise diagnoses, NLP’s influence is everywhere.
The researchers at Google will be presenting over two dozen research papers at the 24th edition of the conference. We’ve cherry-picked the best of the research the tech giant will be presenting at the event. Here goes…
The paper presents DeePMOS, a deep neural network approach for estimating speech signal quality. Unlike traditional methods, DeePMOS provides a distribution of mean-opinion-scores (MOS) with its average and spread. Training robustness is achieved through a mix of maximum-likelihood learning, stochastic gradient noise, and student-teacher setup, addressing limited and noisy human listener data.
DeePMOS demonstrates comparable performance to existing methods that offer only point estimates, as confirmed by standard metrics. The researchers stated that an analysis underscores the method’s effectiveness.
Authors: Xinyu Liang, Christian Schüldt, et al.
Re-investigating the Efficient Transfer Learning of Speech Foundation Model Using Feature Fusion Methods
The research examines speech foundation models for adapting to specific speech recognition tasks. Efficient fine-tuning methods are employed to adjust the models, and a feature fusion approach is proposed for enhanced transfer learning. Results demonstrate reduced parameters and computational memory usage (31.7% and 13.4% respectively), without compromising task quality.
Authors: Zhouyuan Huo, Khe Chai Sim, Dongseong Hwang, Tsendsuren Munkhdalai, Tara N Sainath, Pedro Moreno.
In the paper, researchers introduce LanSER, a method to enhance Speech Emotion Recognition (SER) models. The approach leverages large language models (LLMs) to deduce emotion labels from unlabeled data through weakly-supervised learning. The team further used a textual entailment approach to select the best emotion label for a speech transcript, maintaining a specific emotion taxonomy.
The experiments reveal that pre-trained models with this weak supervision surpass other baselines on standard SER datasets post fine-tuning, showcasing improved label efficiency. Surprisingly, these models capture speech prosody (how it is said), even though trained primarily on text-derived labels. This method addresses the challenge of costly labeled data in scaling SER to broader speech datasets and nuanced emotions.
Authors: Josh Belanich, Krishna Somandepalli, Arsha Nagrani, et al.
In the research a fresh dataset of conversational speech is introduced, representing English from India, Nigeria, and the United States. The MD3 comprises 20+ hours of audio and 200,000+ transcribed tokens.
Unlike prior datasets, MD3 combines free-flowing conversation and task-based dialogues, allowing for cross-dialect comparisons without limiting dialect features. The dataset sheds light on distinct syntax and discourse marker usage across dialects, providing valuable insights.
Authors: Jacob Eisenstein, Vinodkumar Prabhakaran, Clara Rivera, et al.
Accurately recognizing specific categories like names and dates is critical in Automatic Speech Recognition (ASR). Handling this personal info ethically, from collection to evaluation, is important. While redacting Personally Identifiable Information (PII) can protect privacy, it can hurt ASR accuracy. In this study, the researchers boosted PII recognition by injecting fake substitutes into training data. This improves ‘Name’ and ‘Date’ recall in medical notes and overall Word Error Rate. Furthermore, for alphanumeric sequences, Character Error Rate and Sentence Accuracy also improve.
Authors: Yochai Blau, Rohan Agrawal, Lior Madmony, et al.
The paper introduces an advanced model that transcribes speech into the International Phonetic Alphabet (IPA) for any language. Despite its size, the model achieves comparable or better results and nearly matches human annotator quality for universal speech-to-IPA conversion.
This eases the time-consuming language documentation process, especially for endangered languages. Building on the previous Wav2Vec2Phoneme model, it’s based on wav2vec 2.0 and fine-tuned to predict IPA from audio. Training data from CommonVoice 11.0, covering seven languages, was semi-automatically transcribed into IPA, resulting in a smaller yet higher-quality dataset compared to Wav2Vec2Phoneme.
Authors: Chihiro Taguchi, Parisa Haghani, et al.
You can find the entire list of papers presented by Google at INTERSPEECH 2023 here.