Top music generation datasets in 2022

CompMusic catalogues datasets for Indian art music.

Artificial intelligence has been tapped for synthetic music generation for some time now. However, the watershed moment came when music informatics met AI. Now, music AI researchers are taking advantage of the latest AI, ML, and analytics developments to develop models to create music on par with human composers. 

The strength of any AI/ML model is predicated on the data it’s fed. Below, we look at the major music generation datasets doing the rounds in 2022.


Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The dataset probably contains the largest instrumental notes at 305,979 musical notes, including unique pitch, timbre and envelope. The musical notes were collected from 1,006 instruments from commercial sample libraries and are annotated based on (acoustic, electronic or synthetic) instrument family and sonic qualities. Instruments including bass, flute guitar, keyboard, mallet, organ, reed, string, synth lead and vocal have been used in this dataset.


The MAESTRO (MIDI and Audio Edited for Synchronous Tracks and Organisation) dataset has over 200 hours of paired audio and MIDI recordings of International Piano-e-Competition in the past ten years. The MIDI data has key strike velocities, including sustain/sostenuto/una corda pedal positions. 

URMP (University of Rochester Multi-Modal Musical Performance) 

URMP introduced a dataset for facilitating audio-visual analysis of musical performances. The dataset contains several simple multi-instrument musical pieces assembled from separately recorded performances of individual tracks. For each piece, a musical score is provided in MIDI format.

Bach Doodle Dataset

The dataset consists of 21.6 million harmonisations submitted from the Bach Doodle and metadata about the composition, like country of origin and the feedback. It also has MIDI of user-entered melody and MIDI of the generated harmonisation. An exploration of the melodies in the dataset contains top repeated melodies from each country or the regional hits.

The Lakh MIDI Dataset v0.1

This dataset contains 176,581 unique MIDI files, 45,129 matched and aligned to the Million Song Dataset entries. The dataset mainly facilitates large-scale music information retrieval, both symbolic and audio content-based. 

Music 21

Music21 contains music performances from 21 categories. It is a set of tools to help scholars quickly find answers to questions like, “I wonder how often Bach does that” or “Which band used these chords for the first time? Or “How to know more about Renaissance counterpoint or the Indian ragas or post-tonal pitch structures or the form of minutes”. 

Datasets for Indian Music

CompMusic catalogues datasets for Indian art music. The website aims to advance the automatic description of music by emphasising cultural specificity carrying research in music information processing with a domain knowledge approach. The project mainly focused on five music traditions of the world: Hindustani (North India), Carnatic (South India), Turkish-makam (Turkey), Arab-Andalusian (Maghreb), and Beijing Opera (China).

Indian Music Tonic Dataset

The dataset contains 597 commercially available audio music recordings of Indian art music, both Hindustani and Carnatic music, where each recording is manually annotated with the tonic of the lead artist.

Carnatic Varnam Dataset

The dataset has 28 solo vocal recordings recorded to be researched on the intonation analysis of Carnatic ragas. In addition, the dataset contains audio recordings, time aligned tala cycle annotations and swara notations in a machine-readable format.

Carnatic Music Rhythm Dataset

This dataset is a sub-collection of 176 excerpts in four taalas of Carnatic music with audio, tala related metadata and time aligned markers to indicate the progress through the tala cycles.

Hindustani Music Rhythm Dataset

The dataset is a sub-collection of 151 in four taals of Hindustani music which includes audio, taal related metadata and time aligned markers to indicate the progress through the taal cycles.

Mridangam Stroke Dataset

The dataset contains 7,162 audio examples of individual strokes of the Mridangam in various tonics and ten different strokes played on Mridangams with six tonic values.

Mridangam Tani-avarthanam Dataset

The dataset is a transcribed collection of two tani-avarthanams played by Mridangam maestro Padmavibhushan Umayalpuram K. Sivaraman. The audio of the dataset was recorded at the IIT Madras and annotated by Carnatic percussionists. The dataset contains 24 minute of audio and 8,800 strokes.

Saraga: Research datasets of Indian Art Music

The repository contains time aligned melody, rhythm and structural annotations for two large open datasets of Indian Art Music (Carnatic and Hindustani music).

Tabla Solo Dataset

The dataset is a transcribed collection of Tabla solo audio recordings spanning compositions from six different Gharanas of Tabla, played by Pt. Arvind Mulgaonkar. It consists of audio and time aligned bol transcriptions.

Poornima Nataraj
Poornima Nataraj has worked in the mainstream media as a journalist for 12 years, she is always eager to learn anything new and evolving. Witnessing a revolution in the world of Analytics, she thinks she is in the right place at the right time.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox