Microsoft’s DeepSinger Generates Voices That Can Sing In English and Chinese

A team of researchers from Microsoft and Zhejiang University recently developed a multi-lingual multi-singer singing voice synthesis (SVS) system known as DeepSinger. The system is built from scratch using singing training data mined from music websites.

With the advancement of deep neural networks, Singing Voice Synthesis (SVS) generates singing voices from lyrics, which has attracted much traction in the field of research and industrial community in recent years. This technique is similar to the text-to-speech method that enables machines to speak.

Traditional SVS mostly relies on human recording and annotations and requires a large number of high-quality singing recordings as training data as well as strict data alignments between lyrics and singing audio for accurate singing modelling. This, in result, increases the costs of data labelling and impedes the research and developments of products in this area. These ongoing challenges led to the development of a new SVS system, DeepSinger.

Behind DeepSinger

DeepSinger, a singing voice synthesis system that is built from scratch by using singing training data. The pipeline of DeepSinger consists of several data mining and modelling steps. They are:-

  • Data crawling: In order to obtain a large number of songs from the Internet, the researchers crawled tens of thousands of songs and their lyrics of top singers in three different languages, Chinese, Cantonese and English from a music website.
  • Singing and Accompaniment Separation: A popular music separation tool, Spleeter has been used to separate singing voices from song accompaniments.
  • Lyrics-to-Singing Alignment: An alignment model is built to segment the audio into sentences and extract the singing duration of each phoneme in lyrics.
  • Data Filtration: The aligned lyrics and singing voices are then filtered according to their confidence scores in alignment.
  • Singing Modelling: A feed-forward Transformer, FastSpeech based singing model is built, which leverages a reference encoder to handle noisy data.

The researchers designed a lyrics-to-singing alignment model based on automatic speech recognition to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level. 

Further, they designed a multi-lingual multi-singer singing model based on a feed-forward Transformer known as FastSpeech to directly generate linear-spectrograms from lyrics, and synthesise voices using Griffin-Lim, which is a popular vocoder to reconstruct voices given linear-spectrograms.

Advantages of DeepSinger

DeepSinger has a number of advantages over previous SVS systems. They are mentioned below:-

  • According to the researchers, DeepSinger is time-efficient as it directly mines training data from music websites.
  • It avoids any human efforts for alignment labelling which is a cost-effective technique.
  • DeepSinger is simple and efficient in nature than the previous SVS systems.
  • It can synthesise singing voices in several languages and multiple singers. 

Contributions of This Research

The contributions of this paper are as follows:-

  • DeepSinger is the first SVS system built from data directly mined from the web, without any high-quality singing data recorded by humans.
  • The lyrics-to-singing alignment model avoids any human efforts for alignment labelling and greatly reduces labelling cost.
  • The FastSpeech based singing model is simple and efficient, by removing the complicated acoustic feature modelling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data.
  • DeepSinger can synthesise high-quality singing voices in multiple languages and multiple singers.

Wrapping Up

In order to evaluate the effectiveness of DeepSinger System, the researchers used a purely-mined singing dataset from the web that includes 92 hours data with 89 singers and three languages. According to the researchers, the experimental outcomes showed that DeepSinger can synthesise high-quality singing voices in terms of both pitch accuracy and voice naturalness.

Read the paper here.

Download our Mobile App

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox