Google Unveils MusicLM, A Music DALL-E

The team also released MusicCaps, a high-quality dataset of 5.5k music-text pairs prepared by musicians.
Listen to this story

Google has released  MusicLM, a generative model for creating high-fidelity music from text descriptions, such as “a calming violin melody supported by a distorted guitar riff”. MusicLM makes music at 24 kHz that holds steady for several minutes by modelling the process of conditional music synthesis as a hierarchical sequence-to-sequence modelling problem. 

According to tests, MusicLM works better than older systems in terms of audio quality and fidelity to the written descriptions. MusicLM can be conditioned on both text and a melody by changing whistled and hummed melodies to match a text caption’s description of that style. 

It also unveiled MusicCaps, the first evaluation dataset collected specifically for the task of text-to-music generation. It is a hand-curated, high-quality dataset of 5.5k music-text pairs prepared by musicians. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Read the full paper here.

Key Features

MusicLM can create music from any text description. Plus, if the audio of a melody is given, it can generate new music inspired by that melody customized by prompts. It turned someone humming ‘Bella Ciao’ into a cappella chorus. It can generate audio with stories and progression and also generate music from paintings.

Training Process

Each stage is modelled as a sequence-to-sequence task leveraging decoder-only Transformers. 

During training, MuLan audio tokens, semantic tokens, and acoustic tokens from the audio-only training set are extracted. 

In the semantic modelling stage, semantic tokens are predicted using MuLan audio tokens as conditioning.

In the next acoustic modelling stage, the model predicts acoustic tokens with both MuLan audio tokens and semantic tokens. 

During inference, MuLan text tokens, computed from the text prompt, are used as a conditioning signal and convert the generated audio tokens to waveforms using the SoundStream decoder.


Some limitations of the method are inherited from MuLan, in that the model misunderstands negations and does not adhere to the precise temporal ordering described in the text. 

The Music DALL-E

Similarly to how DALL-E 2 uses CLIP for text encoding, MusicLM is based on a joint music-text embedding model for the same purpose.  But unlike DALL-E 2, which uses a diffusion model as a decoder, MusicLM’s decoder is based on AudioLM. 

Two weeks ago, Microsoft released VALL-E, a new language model approach for text-to-speech synthesis (TTS) that uses audio codec codes as intermediate representations. It demonstrated in-context learning capabilities in zero-shot scenarios after being pre-trained on 60,000 hours of English speech data.

However, Google has announced it will not make MusicLM available to the public due to potential risks. These include the possibility of programming biases leading to underrepresentation and cultural appropriation, technical errors, and the risk of unauthorized use of creative content.

Shritama Saha
Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox