Microsoft recently developed a large scale pre-trained model for symbolic music understanding called MusicBERT. Symbolic music understanding refers to understanding music from the symbolic data (for example, MIDI format). It covers many music applications such as emotion classification, genre classification, and music pieces matching.
For developing MusicBERT, Microsoft has used OctupleMIDI method, bar-level masking strategy, and a large scale symbolic music corpus of more than 1 million music tracks.
Sign up for your weekly dose of what's up in emerging technology.
OctupleMIDI is a novel music encoding method that encodes each note into a tuple with eight elements, representing the different aspects of the characteristics of a musical note, including instrument, tempo, bar, position, time signature, pitch, duration, and velocity.
Here are some of the advantages of OctupleMIDI:
- Reduces the length of a music sequence (4x shorter than REMI), thus easing the modelling of music sequences by Transformer considering that music sequences themselves are very long
- It is ‘note’ centric. Since each note contains the same eight tuple structure and covers adequate information to express various music genres, like time signature, long note duration, etc., OctupleMIDI is much easier.
- It is universal compared to previous encoding methods since each note contains the 8-tuple structure to express different music genres.
Different encoding methods for symbolic music understanding (Source: arXiv)
The authors of the study established that it was challenging to apply NLP directly to symbolic music because it differs greatly from natural text data. There are following challenges:
- Music songs are more structural and diverse, making it more difficult to encode as compared to natural language.
- Due to complicated encoding of symbolic music, there are higher chances of information leakage in pre-training
- The pre-training for music understanding is limited due to lack of large-scale symbolic music corpora
To remediate this, researchers Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, and Tie-Yan Liu have developed MusicBERT, a large-scale pre-trained model with music encoding and masking strategy for music understanding. This model evaluates symbolic music understanding tasks, including melody completion, accompaniment suggestion, style classification and genre classification.
Besides OctupleMIDI, MusicBERT uses a bar-level masking strategy. The masking strategy in original BERT for NLP tasks randomly masks some tokens, causing information leakage in music pre-training. However, in the bar-level masking strategy used in MusicBERT, all the tokens of the same type (for example, time signature, instruments, pitch, etc.) are masked in a bar to avoid information leakage and for representational learning.
In addition to this, MusicBERT also uses a large-scale and diverse symbolic music dataset, called the million MIDI dataset (MMD). It contains more than 1 million music songs, with different genres, including Rock, Classical, Rap, Electronic, Jazz, etc. It is one of the most extensive datasets in current literature — ten times larger than the previous largest dataset LMD in terms of the number of songs (148,403 songs and 535 million notes). MMD has about 1,524,557 songs and two billion notes. This dataset benefits representation learning for music understanding significantly.
Model structure of MusicBERT (Source: arXiv)
Further, the model is fine tuned on four tasks like melody completion, accompaniment suggestion, style classification and genre classification against a few baseline models such as melody2vec, tonnetz, pianoroll, PiRhDy and others. MusicBERT shows tremendous improvement for both small as well as baseline models.
The below table shows the results of MusicBERT versus other models.
MusicBERT achieves state-of-the-art performance on all four evaluated symbolic music understanding tasks. In the coming months, the team will attempt applying MusicBERT on other tasks such as structure analysis and chord recognition to boost the model’s performance.