Last updated November 11, 2021
In AI Origins & Evolution

What To Expect From NVIDIA At Interspeech 2021

Researchers from NVIDIA are building models and tools for high-quality, controllable speech synthesis that capture the richness of human speech without audio artefacts.

Published on September 1, 2021
by kumar Gandharv

AI has transformed synthesized speech from the monotone of robocalls and decades-old GPS navigation systems to the polished tone of virtual assistants in smartphones and smart speakers. But, there’s still a gap between Ai-synthesized speech and the human speech we hear in daily conversation. The reason being, people, speak with complex rhythm, intonation and timbre that’s challenging for AI to emulate. Researchers from NVIDIA are building models and tools for high-quality, controllable speech synthesis that capture the richness of human speech without audio artefacts. These models can help voice automated customer service lines by bringing a book or video game characters to life for banks and retailers and provide real-time speech synthesis for digital avatars.

Bryan Catanzaro, VP, Applied Deep Learning Research, NVIDIA, held a press briefing to share the NVIDIA projects, which will be showcased at Interspeech 2021 conference, from 30 August to 3 September 2021. Interspeech is a technical conference devoted to speech processing and applications, focusing on interdisciplinary approaches to all elements of speech science and technology, from fundamental theories to sophisticated applications.

Here is a short brief of the papers presented:

Paper 1: Scene Agnostic Multi-Microphone Speech Dereverberation

When a sound wave travels through an acoustic enclosure, it is reflected by the room’s facets as well as the objects within it. As a result, the direct path signal and the accompanying reflections will be captured by a microphone in that room. This phenomenon, known as reverberation, harms speech quality and, in extreme circumstances, intelligibility, making it difficult for hard of hearing people and automated speech recognition (ASR) systems to understand. The paper presented an NN architecture that can cope with microphone arrays whose number and positions are unknown and demonstrate its applicability in the speech dereverberation task.

Paper 2: Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices

Speech and command recognition tasks require low latency – difficult to achieve while using deep neural networks. Additionally, edge devices have constraints related to bandwidth and energy, so employing such networks here is a costly affair in practical applications. Sparsity and quantization approaches are effective options for lowering the model size and computing cost while remaining compatible with current hardware. The paper thus introduced random ternary 1×1-convolutions that improve speech residual networks’ efficiency, reducing memory and computational cost significantly during training and inference.

Paper 3: SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

In the English speech-to-text (STT) task, acoustic model output typically lacks orthography (conventional spelling system of a language). So, these models usually render uncased Latin characters, and standard features of English text such as punctuation, capitalization, and other formatting information are omitted. As a result, it poses problems for NLP tasks such as natural language understanding, neural machine translation, and summarization. The presented paper introduced a new end-to-end task of fully formatted speech recognition, in which the acoustic model learns to predict complete English orthography.

Paper 4: TalkNet 2

Researchers proposed TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction. The model has three feed-forward convolutional networks.

The first network predicts grapheme durations. An input text is then expanded by repeating each symbol according to the predicted duration.
The second network predicts pitch value for every mel frame.
The third network generates a mel-spectrogram from the expanded text conditioned on the predicted pitch.

Paper 5: Hi-Fi Multi-Speaker English TTS Dataset

Neural network (NN) based text-to-speech (TTS) systems can synthesize speech that sounds very close to natural speech. However, TTS models require speech samples recorded in a professional studio with usually a few dozens of hours per speaker to produce high-quality speech. The dataset introduced in the paper is based on Project Gutenberg texts and LibriVox audiobooks, both available in the public domain. Additionally, the new dataset contains about 292 hours of speech from ten speakers, with at least 17 hours per speaker sampled at 44.1 kHz.

Paper 6: NeMo Inverse Text Normalization: From Development To Production

The research paper introduced an open-source Python WFST-based library for ITN, enabling a seamless path from development to production. Researchers describe the specification of ITN (Inverse Text Normalisation) – grammar rules for English. However, the library can be adapted for other languages. It can also be used for written-to-spoken text normalization. Moreover, the research evaluated the NeMo ITN library using a modified version of the Google Text normalization dataset. NeMo is an open-source Python toolkit for GPU-accelerated conversational AI — researchers, developers, and creators gain a head start in experimenting with and fine-tuning speech models for their applications.

Access all our open Survey & Awards Nomination forms in one place >>

kumar Gandharv

Kumar Gandharv, PGD in English Journalism (IIMC, Delhi), is setting out on a journey as a tech Journalist at AIM. A keen observer of National and IR-related news.

What To Expect From NVIDIA At Interspeech 2021

Paper 1: Scene Agnostic Multi-Microphone Speech Dereverberation

Paper 2: Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices

Paper 3: SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

Paper 4: TalkNet 2

Paper 5: Hi-Fi Multi-Speaker English TTS Dataset

Paper 6: NeMo Inverse Text Normalization: From Development To Production

kumar Gandharv

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.