Listen to this story
|
Meta’s MusicGen is an AI model that utilises the Transformer architecture to generate new pieces of music based on text prompts. It has the capability to align the generated music with existing melodies, providing a versatile and creative approach to music composition.
Similar to language models, MusicGen predicts the next section in a piece of music rather than the next characters in a sentence. This enables it to generate coherent and structured musical compositions.
The audio data used for training is decomposed into smaller components using Meta’s EnCodec audio tokeniser. This approach allows the model to process tokens in parallel, making it efficient and fast in generating music.
The training process involved utilising a dataset of 20,000 hours of licensed music, including 10,000 high-quality music tracks from an internal dataset, as well as music data from Shutterstock and Pond5. This extensive training dataset ensures that MusicGen has access to a diverse range of musical styles and compositions.
One of the key features of MusicGen is its ability to handle both text and music prompts. The text prompt sets the basic style, which is then matched with the melody from the audio file. For example, by combining a text prompt describing a specific style of music with the melody of a famous composition, MusicGen can generate a new piece of music that reflects the desired style.
It is important to note that while MusicGen can provide a rough guideline for generating music based on a specific prompt, it does not offer precise control over the orientation to the melody or the ability to hear a melody in different styles. The generated output serves as a creative interpretation rather than an exact replication.
In terms of performance, the researchers experimented with different sizes of the model, ranging from 300 million to 3.3 billion parameters. They found that larger models generally produced higher quality audio, but the 1.5 billion parameter model was rated the best by human evaluators. The 3.3 billion parameter model excelled in accurately matching text input with audio output.
When compared to other music models such as Riffusion, Mousai, MusicLM, and Noise2Music, MusicGen demonstrates superior performance in objective and subjective metrics, which evaluate the match between music and lyrics as well as the plausibility of the composition. Overall, MusicGen ranks higher than Google’s MusicLM—and it could very well be the StableDiffusion Moment for Music.
Meta has released the code and models of MusicGen as open source on GitHub, allowing researchers and commercial users to access and utilise the technology. This move encourages further development, collaboration, and innovation in the field of AI-generated music. A demo of MusicGen is also available on the Huggingface platform, providing a hands-on experience of its capabilities.