Now Reading
Microsoft’s DeepSinger Generates Voices That Can Sing In English and Chinese

Microsoft’s DeepSinger Generates Voices That Can Sing In English and Chinese

Ambika Choudhury

A team of researchers from Microsoft and Zhejiang University recently developed a multi-lingual multi-singer singing voice synthesis (SVS) system known as DeepSinger. The system is built from scratch using singing training data mined from music websites.

With the advancement of deep neural networks, Singing Voice Synthesis (SVS) generates singing voices from lyrics, which has attracted much traction in the field of research and industrial community in recent years. This technique is similar to the text-to-speech method that enables machines to speak.

Traditional SVS mostly relies on human recording and annotations and requires a large number of high-quality singing recordings as training data as well as strict data alignments between lyrics and singing audio for accurate singing modelling. This, in result, increases the costs of data labelling and impedes the research and developments of products in this area. These ongoing challenges led to the development of a new SVS system, DeepSinger.

Behind DeepSinger

DeepSinger, a singing voice synthesis system that is built from scratch by using singing training data. The pipeline of DeepSinger consists of several data mining and modelling steps. They are:-

  • Data crawling: In order to obtain a large number of songs from the Internet, the researchers crawled tens of thousands of songs and their lyrics of top singers in three different languages, Chinese, Cantonese and English from a music website.
  • Singing and Accompaniment Separation: A popular music separation tool, Spleeter has been used to separate singing voices from song accompaniments.
  • Lyrics-to-Singing Alignment: An alignment model is built to segment the audio into sentences and extract the singing duration of each phoneme in lyrics.
  • Data Filtration: The aligned lyrics and singing voices are then filtered according to their confidence scores in alignment.
  • Singing Modelling: A feed-forward Transformer, FastSpeech based singing model is built, which leverages a reference encoder to handle noisy data.

The researchers designed a lyrics-to-singing alignment model based on automatic speech recognition to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level. 

Further, they designed a multi-lingual multi-singer singing model based on a feed-forward Transformer known as FastSpeech to directly generate linear-spectrograms from lyrics, and synthesise voices using Griffin-Lim, which is a popular vocoder to reconstruct voices given linear-spectrograms.

Advantages of DeepSinger

DeepSinger has a number of advantages over previous SVS systems. They are mentioned below:-

See Also
confidential computing

  • According to the researchers, DeepSinger is time-efficient as it directly mines training data from music websites.
  • It avoids any human efforts for alignment labelling which is a cost-effective technique.
  • DeepSinger is simple and efficient in nature than the previous SVS systems.
  • It can synthesise singing voices in several languages and multiple singers. 

Contributions of This Research

The contributions of this paper are as follows:-

  • DeepSinger is the first SVS system built from data directly mined from the web, without any high-quality singing data recorded by humans.
  • The lyrics-to-singing alignment model avoids any human efforts for alignment labelling and greatly reduces labelling cost.
  • The FastSpeech based singing model is simple and efficient, by removing the complicated acoustic feature modelling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data.
  • DeepSinger can synthesise high-quality singing voices in multiple languages and multiple singers.

Wrapping Up

In order to evaluate the effectiveness of DeepSinger System, the researchers used a purely-mined singing dataset from the web that includes 92 hours data with 89 singers and three languages. According to the researchers, the experimental outcomes showed that DeepSinger can synthesise high-quality singing voices in terms of both pitch accuracy and voice naturalness.

Read the paper here.

Provide your comments below


If you loved this story, do join our Telegram Community.

Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top