The study of the science of harmony can be traced back to the times of Pythagoras. The development of the Pythagorean tuning system is commonly credited to Pythagoras, whereas Greek mathematician Euclid documented numerous experiments on rational tuning. Euler, too, has expressed his interest in studying chord aesthetics. The inexplicable creative ingenuity behind the most remarkable pieces of music can be discussed through concepts of string lengths and other proposals, but why a certain sound pleases the ear is still a mystery. While physicists, mathematicians and philosophers are trying to figure out a universal theory of harmony, the artificial intelligence community has taken the ambitious task of not only understanding composition but making an algorithm to compose music. A couple of years ago, Google came up with a machine learning model called Coconet, which was trained on more than 300 works of the German composer Bach. The model was fed with incomplete notes as input and made to play complete scores.
To do a composition similar to humans, a deep learning model has to: Harmonize melodies, create smooth transitions, rewrite music, and compose from scratch. Music composition (or generation) is the process of creating or writing a new piece of music. But, to accomplish something remotely close to what musicians do, they have to ask themselves a few fundamental questions:
- Are the current DL models capable of generating music with a certain level of creativity?
- What is the best neural network architecture to perform music composition?
- Can a DL model generate entire structured music pieces?
- Can DL models compose music that is totally different from training data?
- How much data do DL models need for music generation?
- Should neural networks compose music by following the same logic and process as humans do?
To answer these questions, researchers from the University of Zaragoza, Spain, have compared the human composition process and the music generation process with deep learning and the artistic and creative characteristics presented by the generated music.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
The researchers look at music composition as a unique human capacity to understand and produce an indefinitely large number of sentences in a language, which have never been encountered or spoken before. This perspective gave them a starting point for designing an AI-based music composition algorithm.

But, before we dive deep into AI-based composition, let’s try to understand how a human composer comes up with certain melodies. In classical music, observe the researchers, one starts with a small unit of one or two bars called motif and develops it to compose a melody or music phrase, and in styles like pop or jazz, it is more common to take a chord progression and compose or improvise a melody ahead of it.
Music is usually defined as a succession of pitches or rhythms, or both, in some definite patterns. Musical scores are three dimensional objects; there are four voices, and the music for each voice can be represented as a two-dimensional array with time extending horizontally and pitches laid out vertically.
If one has to build a model, these vectors can be used to construct a probability distribution. According to the researchers, the generated polyphonic melodies of early neural networks (RNNs and LSTMs) lack in terms of quality harmonic content. This is because neural networks, which are trained to generate music, could not understand the many intricacies of the language of music, as encoding such information in tokens to be fed into neural networks is a challenge in itself.
When Transformer architecture (GPT-2) was used, the results were more coherent. Even the generative models such as GANs showed promise. The researchers believe that attention-based architectures such as Transformers and generative models such as GANs are the best shot yet at DL based music composition. For example, the report stated that MusicVAE and other DL models for music generation showed that new music can be composed without imitating existing music or committing plagiarism. However, models such as MusicVAE demonstrated that AI for music composition would need tremendous amounts of data. For instance, MusicVAE uses 3.7 million melodies, 4.6 million drum patterns, and 116 thousand trios. In contrast, the GPT-2 based model uses lots of text data. “This leads us to affirm that DL models for music generation do need lots of data, especially when training Generative Models or Transformers,” explained the researchers.
Setting aside the technical challenges, there are strong arguments that people prefer music created by people. The role of AI in composition can be nothing more than making stock music. One might even argue that tinkering with AI for creating art is a doorway to artificial general intelligence (AGI). Instead, a better, more neutral premise would be to explore how AI can be used to create tools that a human composer would eventually use. “In the near future, it would be likely that AI could compose structured music from scratch, but the question here is whether AI models for music generation will be used to compose entire music pieces from scratch or whether these models would be more useful as an aid to composers, and thus, as an interaction between humans and AI,” concluded the researchers.