MITB Banner

A Guide to Omnizart: A General Toolbox for Automatic Music Transcription

The activity of notating, reproducing, or otherwise memorizing existing pieces of music is known as music transcription. Music transcription includes melodies, chords, basslines, entire orchestral arrangements, and other aspects.
Share

The activity of notating, reproducing, or otherwise memorizing existing pieces of music is known as music transcription. Music transcription includes melodies, chords, basslines, entire orchestral arrangements, and other aspects. In this post, we will take a look at Yu-Te Wu et al’s recently published Python-based toolbox, which transcribes a given audio music file into the various modes stated above. The following are the important points to be discussed in this article.

Table of Contents

  1. What is Music Transcription? 
  2. Transcription by Omnizart
  3. Transcribing the YouTube music using Omnizart

Let’s start the discussion by understanding what is Music Transcription? 

What is Music Transcription?

Transcription, in music, is the process of notating a piece of sound that was previously unannotated and/or despised as written music, such as a jazz improvisation or a video game soundtrack. When a musician is tasked with creating sheet music from a recording, he or she creates a musical transcription by writing down the notes that make up the composition in music notation. 

Transcription also refers to the practice of adapting a solo or ensemble piece of music for a different instrument or instruments than those originally intended. In this context, transcription and arrangement are sometimes used interchangeably, however, transcriptions are strictly speaking faithful adaptations, whereas arrangements change significant characteristics of the original composition.

Transcription by Omnizart

Omnizart is a new Python library that offers a simplified solution for automatic music transcription (AMT). Omnizart includes modules that build the life-cycle of deep learning-based AMT and is designed for ease of use with a small command-line interface. Omnizart is the first transcription toolkit to include models for a wide range of instruments, including solo, instrument ensembles, percussion instruments, and vocal, as well as models for chord recognition and beat/downbeat tracking, two MIR tasks closely related to AMT. Omnizart includes the following features: 

  • Pre-trained models for frame-level and note-level transcription of multiple pitched instruments, vocal melody, and drum events.
  • Chord recognition and beat/downbeat tracking models that have been pre-trained.
  • The main functionalities in the life-cycle of AMT research, ranging from dataset downloading to feature pre-processing, model training, and transcription result sonification.

Let’s briefly discuss how does it transcribe the music into six types.

Piano Transcription

In the Omnizart model, the piano solo transcription model is a U-net structure that generates a time pitch representation with a time resolution of 20ms and a pitch resolution of 25 cents (1/4 semitone). The pitch activation (i.e. piano roll) channel, the onset channel, and the offset channel are the three 2-D channels in the output time-pitch representation. These output channels are used to get the MIDI transcription results.

Multi-Instrument Transcription

The multi-instrument transcription model is identical to the piano solo model, but its output includes 11 instrument classes from the MusicNet training dataset, including piano, violin, viola, cello, flute, horn, bassoon, clarinet, harpsichord, contrabass, and oboe. By default, this model supports the problematic instrument-agnostic transcription scenario, which implies that the instrument classes in the test music piece are unknown. 

To achieve multi-instrument transcription, the model generates 11 channels of piano rolls, each representing a distinct type of instrument. This model has the same time and pitch resolutions as the piano solo transcription model.

Drum Transcription

The model is built on a convolutional neural network (CNN) and is designed to anticipate the commencement of percussive events from auditory input. It has five convolutional layers, one attention layer, and three fully-connected layers, with a total of roughly 9.4 million parameters. 

Because the onsets of percussive events are significantly connected with beats, the input spectrogram is processed using an automatic beat-tracker in the data pre-processing pipeline. The model is then fed the processed input, which contains rich beat information, for onset prediction.

Vocal Transcription

The voice transcription model is a hybrid network that receives a multi-channel feature consisting of the spectrum, generalized cepstrum, and generalized cepstrum of spectrum derived from the input audio and outputs the transcribed MIDI result.

Chord Recognition

The Harmony Transformer (HT), a deep learning model for harmony analysis, is used to construct Omnizart’s harmony recognition feature. The HT model identifies the chord changes and chord progression of input music using an encoder-decoder architecture. 

The encoder conducts chord segmentation on the input, and the decoder detects the chord progression based on the outcome of the segmentation. The HT exhibited its potential capability of harmony recognition with this unique technique.

Beat/Downbeat Tracking

Most open-source beat/downbeat tracking tools, such as madmom [24] and librosa [25], only accept audio signal input, but the adopted approach does not. The model accepts MIDI data as input and produces beat and downbeat positions in seconds with a 10ms time precision. 

The model is based on a bidirectional LSTM (BLSTM) recurrent neural network (RNN) with an optional attention mechanism and a fully connected layer. Piano roll, spectral flux, and inter-onset interval are all collected from MIDI and used as input characteristics. 

By default, the BLSTM network’s hidden units have a dimension of 25. To forecast the probability values of the beat and downbeat for each time step, the model uses the multi-tasking learning (MTL) architecture.

Transcribing the YouTube music using Omnizart

In this section, we take music from YouTube and will try to transcribe it on the type that Omnizart offers.

  1. Setting up the environment 
!pip install -U pip
!pip install git+https://github.com/Music-and-Culture-Technology-Lab/omnizart
!omnizart download-checkpoints
!apt install fluidsynth
!pip install pyfluidsynth
!curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o /usr/local/bin/yt-dlp
!chmod a+rx /usr/local/bin/yt-dlp
  1. Now assign the desired music link to the variable called URL, by running the following snippet you will have an audio version of the music that you have chosen.
import os
from google.colab import files
from IPython import display as dsp
 
url = input("Enter your YouTube link: ")
 
try:
  id = url.split("watch?v=")[1].split("&")[0]
  vid = dsp.YouTubeVideo(id)
  dsp.display(vid)
except Exception:
  pass
 
print("Downloading...")
 
!yt-dlp -x --audio-format mp3 --no-playlist "$url"
!yt-dlp --get-filename --no-playlist "$url" > tmp
 
uploaded_audio = os.path.splitext(open("tmp").readline().strip())[0]
!ffmpeg -i "$uploaded_audio".mp3 "$uploaded_audio".wav &> /dev/null
 
print(f"Finished: {uploaded_audio}")

Below is the audio version of the file that I have chosen.

  1. Now we have loaded the desired song, we can now transcribe it in any form, here I’ll go with the Drum transcription.
# available_mode_for_trasncription 
trasncription_modes = ["music-piano", "music-piano-v2", "music-assemble", "chord", "drum", "vocal", "vocal-contour", "beat"]
 
mode = "drum" 
model = ""
if mode.startswith("music"):
  mode_list = mode.split("-")
  mode = mode_list[0]
  model = "-".join(mode_list[1:])
 
 
from omnizart.music import app as mapp
from omnizart.chord import app as capp
from omnizart.drum import app as dapp
from omnizart.vocal import app as vapp
from omnizart.vocal_contour import app as vcapp
from omnizart.beat import app as bapp
 
app = {
    "music": mapp,
    "chord": capp,
    "drum": dapp,
    "vocal": vapp,
    "vocal-contour": vcapp,
    "beat": bapp
}[mode]
 
model_path = {
    "piano": "Piano",
    "piano-v2": "PianoV2",
    "assemble": "Stream",
    "pop-song": "Pop",
    "": None
}[model]
 
midi = app.transcribe(f"{uploaded_audio}.wav", model_path=model_path)
 
# Synthesize MIDI and play
import scipy.io.wavfile as wave
from omnizart.remote import download_large_file_from_google_drive
 
SF2_FILE = "general_soundfont.sf2"
if not os.path.exists(SF2_FILE):
  print("Downloading soundfont...")
  download_large_file_from_google_drive(
      "16RM-dWKcNtjpBoo7DFSONpplPEg5ruvO",
      file_length=31277462,
      save_name=SF2_FILE
    )
 
if mode == "vocal-contour":
  os.rename(f"{uploaded_audio}_trans.wav", f"{uploaded_audio}_synth.wav")
else:
  print("Synthesizing MIDI...")
  out_name = f"{uploaded_audio}_synth.wav"
  raw_wav = midi.fluidsynth(fs=44100, sf2_path=SF2_FILE)
  wave.write(out_name, 44100, raw_wav)
 
!ffmpeg -i "$out_name" "tmp_synth.mp3" &>/dev/null
!mv tmp_synth.mp3 "$uploaded_audio"_synth.mp3
 
out_name = out_name.replace(".wav", ".mp3")
print(f"Finished: {out_name}")
dsp.Audio(out_name)

Now here is the transcribed Output.

Final Words

Through this post, we have discussed what is music transcription and in contrast to it, we have seen the newly launched Python Toolbox for music transcription called Omnizart. It transcribes the given audio files into more than 6 modes and we have discussed all those. Lastly, we have seen the practical implementation of it where we have taken music from YouTube and transcribed it in the Drum mode and the results are quite impressive.

References  

PS: The story was written using a keyboard.
Share
Picture of Vijaysinh Lendave

Vijaysinh Lendave

Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India