AI has transformed synthesized speech from the monotone of robocalls and decades-old GPS navigation systems to the polished tone of virtual assistants in smartphones and smart speakers. But, there’s still a gap between Ai-synthesized speech and the human speech we hear in daily conversation. The reason being, people, speak with complex rhythm, intonation and timbre that’s challenging for AI to emulate. Researchers from NVIDIA are building models and tools for high-quality, controllable speech synthesis that capture the richness of human speech without audio artefacts. These models can help voice automated customer service lines by bringing a book or video game characters to life for banks and retailers and provide real-time speech synthesis for digital avatars.
Bryan Catanzaro, VP, Applied Deep Learning Research, NVIDIA, held a press briefing to share the NVIDIA projects, which will be showcased at Interspeech 2021 conference, from 30 August to 3 September 2021. Interspeech is a technical conference devoted to speech processing and applications, focusing on interdisciplinary approaches to all elements of speech science and technology, from fundamental theories to sophisticated applications.
Here is a short brief of the papers presented:
Paper 1: Scene Agnostic Multi-Microphone Speech Dereverberation
When a sound wave travels through an acoustic enclosure, it is reflected by the room’s facets as well as the objects within it. As a result, the direct path signal and the accompanying reflections will be captured by a microphone in that room. This phenomenon, known as reverberation, harms speech quality and, in extreme circumstances, intelligibility, making it difficult for hard of hearing people and automated speech recognition (ASR) systems to understand. The paper presented an NN architecture that can cope with microphone arrays whose number and positions are unknown and demonstrate its applicability in the speech dereverberation task.
Paper 2: Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices
Speech and command recognition tasks require low latency – difficult to achieve while using deep neural networks. Additionally, edge devices have constraints related to bandwidth and energy, so employing such networks here is a costly affair in practical applications. Sparsity and quantization approaches are effective options for lowering the model size and computing cost while remaining compatible with current hardware. The paper thus introduced random ternary 1×1-convolutions that improve speech residual networks’ efficiency, reducing memory and computational cost significantly during training and inference.
Paper 3: SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
In the English speech-to-text (STT) task, acoustic model output typically lacks orthography (conventional spelling system of a language). So, these models usually render uncased Latin characters, and standard features of English text such as punctuation, capitalization, and other formatting information are omitted. As a result, it poses problems for NLP tasks such as natural language understanding, neural machine translation, and summarization. The presented paper introduced a new end-to-end task of fully formatted speech recognition, in which the acoustic model learns to predict complete English orthography.
Paper 4: TalkNet 2
Researchers proposed TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction. The model has three feed-forward convolutional networks.
- The first network predicts grapheme durations. An input text is then expanded by repeating each symbol according to the predicted duration.
- The second network predicts pitch value for every mel frame.
- The third network generates a mel-spectrogram from the expanded text conditioned on the predicted pitch.
Paper 5: Hi-Fi Multi-Speaker English TTS Dataset
Neural network (NN) based text-to-speech (TTS) systems can synthesize speech that sounds very close to natural speech. However, TTS models require speech samples recorded in a professional studio with usually a few dozens of hours per speaker to produce high-quality speech. The dataset introduced in the paper is based on Project Gutenberg texts and LibriVox audiobooks, both available in the public domain. Additionally, the new dataset contains about 292 hours of speech from ten speakers, with at least 17 hours per speaker sampled at 44.1 kHz.
Paper 6: NeMo Inverse Text Normalization: From Development To Production
The research paper introduced an open-source Python WFST-based library for ITN, enabling a seamless path from development to production. Researchers describe the specification of ITN (Inverse Text Normalisation) – grammar rules for English. However, the library can be adapted for other languages. It can also be used for written-to-spoken text normalization. Moreover, the research evaluated the NeMo ITN library using a modified version of the Google Text normalization dataset. NeMo is an open-source Python toolkit for GPU-accelerated conversational AI — researchers, developers, and creators gain a head start in experimenting with and fine-tuning speech models for their applications.