One of the popular AI research labs, OpenAI has been working tremendously in the domain of artificial intelligence, particularly on the grounds of neural networks, reinforcement learning, among others. Just a few days back, the AI lab introduced Microscope for AI enthusiasts who are interested in exploring how neural network work.
And now the audio team of OpenAI has introduced a new machine learning model known as Jukebox that generates music while singing in the raw audio domain. This AI model is fed with genre, artist, and lyrics as input to generate new music samples that are produced from scratch.
Sign up for your weekly dose of what's up in emerging technology.
Over the past few years, generative modelling has made various groundbreaking progress. One of the crucial goals of generative modelling is to capture the important features of the data and create new instances that are indistinguishable from the true data.
In this work, the researchers used the state-of-the-art deep generative models to produce a single system capable of generating diverse high-fidelity music in the raw audio domain with long-range coherence spanning multiple minutes. The researchers stated, “We chose to work on music because we want to continue to push the boundaries of generative models.”
Jukebox is a neural network model that generates music, including rudimentary singing, as raw audio in a variety of genres and artist’s styles. Unlike other music generator models, this neural net model follows a different approach, which is to model music directly as raw audio. Generating music at the audio level is usually challenging due to the very long sequences.
One of the ways of diminishing the issue of long input is to use an autoencoder that will compress raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. Jukebox’s autoencoder model compresses audio to a discrete space, using a quantisation-based approach called VQ-VAE.
VQ-VAE is an approach of downsampling extremely long context inputs to a shorter-length discrete latent encoding using vector quantisation. The model uses a hierarchical VQ-VAE architecture for compressing audio into a discrete space, along with a loss function designed to retain the maximum amount of musical information.
According to the researchers, while the previous work has generated raw audio music in the 20–30 second range, this new neural net model is capable of generating pieces that are multiple minutes long, and with recognisable singing in natural-sounding voices.
To train the Jukebox model, the researchers crawled the web to curate a new dataset of 1.2 million songs, from which 600,000 were in English. Following this, it was paired with the corresponding lyrics and metadata from LyricWiki, where the metadata includes artist, album genre, and year of the songs, along with common moods or playlist keywords associated with each song. The model is further trained on 32-bit, 44.1 kHz raw audio and data augmentation are performed by randomly downmixing the right and left channels to produce mono audio.
Limitations of This Model
The researchers mentioned that there is a significant gap between music generations and human-created music. Some of the limitations are mentioned below:
- The generated songs show a variety of features such as local musical coherence, feature impressive solos and traditional chord patterns, but it lacks familiar larger musical structures such as choruses that usually repeat in a song
- The downsampling and upsampling process introduces discernable noise. However, improving the VQ-VAE to capture more musical information would help reduce this issue
- Because of the autoregressive nature of sampling, the performance of the model is slower. According to the researchers, it takes approximately 9 hours to fully render one minute of audio through our models, and thus they cannot yet be used in interactive applications
- Currently, the model is only trained in English and mostly western lyrics, songs in other languages are yet to be trained
OpenAI has been working on generating automatic audio samples conditioned on different kinds of priming information for a few years now. With the creation of Jukebox, the researchers hope that it will improve the musicality of samples with unique lyrics, and thus providing a way of giving musicians more control over the generations. They have released the model weights and code, including a tool that will help in exploring the generated samples.
This is not the first time that the San Francisco-based AI research laboratory applied AI to create music. Last year, OpenAI introduced MuseNet, which is a deep neural network that can generate 4-minute musical compositions with 10 different instruments and combine styles from country to Mozart and the Beatles.
Read the paper here.