The activity of notating, reproducing, or otherwise memorizing existing pieces of music is known as music transcription. Music transcription includes melodies, chords, basslines, entire orchestral arrangements, and other aspects. In this post, we will take a look at Yu-Te Wu et al’s recently published Python-based toolbox, which transcribes a given audio music file into the various modes stated above. The following are the important points to be discussed in this article.
Table of Contents
- What is Music Transcription?
- Transcription by Omnizart
- Transcribing the YouTube music using Omnizart
Let’s start the discussion by understanding what is Music Transcription?
What is Music Transcription?
Transcription, in music, is the process of notating a piece of sound that was previously unannotated and/or despised as written music, such as a jazz improvisation or a video game soundtrack. When a musician is tasked with creating sheet music from a recording, he or she creates a musical transcription by writing down the notes that make up the composition in music notation.
Transcription also refers to the practice of adapting a solo or ensemble piece of music for a different instrument or instruments than those originally intended. In this context, transcription and arrangement are sometimes used interchangeably, however, transcriptions are strictly speaking faithful adaptations, whereas arrangements change significant characteristics of the original composition.
Transcription by Omnizart
Omnizart is a new Python library that offers a simplified solution for automatic music transcription (AMT). Omnizart includes modules that build the life-cycle of deep learning-based AMT and is designed for ease of use with a small command-line interface. Omnizart is the first transcription toolkit to include models for a wide range of instruments, including solo, instrument ensembles, percussion instruments, and vocal, as well as models for chord recognition and beat/downbeat tracking, two MIR tasks closely related to AMT. Omnizart includes the following features:
- Pre-trained models for frame-level and note-level transcription of multiple pitched instruments, vocal melody, and drum events.
- Chord recognition and beat/downbeat tracking models that have been pre-trained.
- The main functionalities in the life-cycle of AMT research, ranging from dataset downloading to feature pre-processing, model training, and transcription result sonification.
Let’s briefly discuss how does it transcribe the music into six types.
Piano Transcription
In the Omnizart model, the piano solo transcription model is a U-net structure that generates a time pitch representation with a time resolution of 20ms and a pitch resolution of 25 cents (1/4 semitone). The pitch activation (i.e. piano roll) channel, the onset channel, and the offset channel are the three 2-D channels in the output time-pitch representation. These output channels are used to get the MIDI transcription results.
Multi-Instrument Transcription
The multi-instrument transcription model is identical to the piano solo model, but its output includes 11 instrument classes from the MusicNet training dataset, including piano, violin, viola, cello, flute, horn, bassoon, clarinet, harpsichord, contrabass, and oboe. By default, this model supports the problematic instrument-agnostic transcription scenario, which implies that the instrument classes in the test music piece are unknown.
To achieve multi-instrument transcription, the model generates 11 channels of piano rolls, each representing a distinct type of instrument. This model has the same time and pitch resolutions as the piano solo transcription model.
Drum Transcription
The model is built on a convolutional neural network (CNN) and is designed to anticipate the commencement of percussive events from auditory input. It has five convolutional layers, one attention layer, and three fully-connected layers, with a total of roughly 9.4 million parameters.
Because the onsets of percussive events are significantly connected with beats, the input spectrogram is processed using an automatic beat-tracker in the data pre-processing pipeline. The model is then fed the processed input, which contains rich beat information, for onset prediction.
Vocal Transcription
The voice transcription model is a hybrid network that receives a multi-channel feature consisting of the spectrum, generalized cepstrum, and generalized cepstrum of spectrum derived from the input audio and outputs the transcribed MIDI result.
Chord Recognition
The Harmony Transformer (HT), a deep learning model for harmony analysis, is used to construct Omnizart’s harmony recognition feature. The HT model identifies the chord changes and chord progression of input music using an encoder-decoder architecture.
The encoder conducts chord segmentation on the input, and the decoder detects the chord progression based on the outcome of the segmentation. The HT exhibited its potential capability of harmony recognition with this unique technique.
Beat/Downbeat Tracking
Most open-source beat/downbeat tracking tools, such as madmom [24] and librosa [25], only accept audio signal input, but the adopted approach does not. The model accepts MIDI data as input and produces beat and downbeat positions in seconds with a 10ms time precision.
The model is based on a bidirectional LSTM (BLSTM) recurrent neural network (RNN) with an optional attention mechanism and a fully connected layer. Piano roll, spectral flux, and inter-onset interval are all collected from MIDI and used as input characteristics.
By default, the BLSTM network’s hidden units have a dimension of 25. To forecast the probability values of the beat and downbeat for each time step, the model uses the multi-tasking learning (MTL) architecture.
Transcribing the YouTube music using Omnizart
In this section, we take music from YouTube and will try to transcribe it on the type that Omnizart offers.
- Setting up the environment
!pip install -U pip !pip install git+https://github.com/Music-and-Culture-Technology-Lab/omnizart !omnizart download-checkpoints !apt install fluidsynth !pip install pyfluidsynth !curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o /usr/local/bin/yt-dlp !chmod a+rx /usr/local/bin/yt-dlp
- Now assign the desired music link to the variable called URL, by running the following snippet you will have an audio version of the music that you have chosen.
import os from google.colab import files from IPython import display as dsp url = input("Enter your YouTube link: ") try: id = url.split("watch?v=")[1].split("&")[0] vid = dsp.YouTubeVideo(id) dsp.display(vid) except Exception: pass print("Downloading...") !yt-dlp -x --audio-format mp3 --no-playlist "$url" !yt-dlp --get-filename --no-playlist "$url" > tmp uploaded_audio = os.path.splitext(open("tmp").readline().strip())[0] !ffmpeg -i "$uploaded_audio".mp3 "$uploaded_audio".wav &> /dev/null print(f"Finished: {uploaded_audio}")
Below is the audio version of the file that I have chosen.
- Now we have loaded the desired song, we can now transcribe it in any form, here I’ll go with the Drum transcription.
# available_mode_for_trasncription trasncription_modes = ["music-piano", "music-piano-v2", "music-assemble", "chord", "drum", "vocal", "vocal-contour", "beat"] mode = "drum" model = "" if mode.startswith("music"): mode_list = mode.split("-") mode = mode_list[0] model = "-".join(mode_list[1:]) from omnizart.music import app as mapp from omnizart.chord import app as capp from omnizart.drum import app as dapp from omnizart.vocal import app as vapp from omnizart.vocal_contour import app as vcapp from omnizart.beat import app as bapp app = { "music": mapp, "chord": capp, "drum": dapp, "vocal": vapp, "vocal-contour": vcapp, "beat": bapp }[mode] model_path = { "piano": "Piano", "piano-v2": "PianoV2", "assemble": "Stream", "pop-song": "Pop", "": None }[model] midi = app.transcribe(f"{uploaded_audio}.wav", model_path=model_path) # Synthesize MIDI and play import scipy.io.wavfile as wave from omnizart.remote import download_large_file_from_google_drive SF2_FILE = "general_soundfont.sf2" if not os.path.exists(SF2_FILE): print("Downloading soundfont...") download_large_file_from_google_drive( "16RM-dWKcNtjpBoo7DFSONpplPEg5ruvO", file_length=31277462, save_name=SF2_FILE ) if mode == "vocal-contour": os.rename(f"{uploaded_audio}_trans.wav", f"{uploaded_audio}_synth.wav") else: print("Synthesizing MIDI...") out_name = f"{uploaded_audio}_synth.wav" raw_wav = midi.fluidsynth(fs=44100, sf2_path=SF2_FILE) wave.write(out_name, 44100, raw_wav) !ffmpeg -i "$out_name" "tmp_synth.mp3" &>/dev/null !mv tmp_synth.mp3 "$uploaded_audio"_synth.mp3 out_name = out_name.replace(".wav", ".mp3") print(f"Finished: {out_name}") dsp.Audio(out_name)
Now here is the transcribed Output.
Final Words
Through this post, we have discussed what is music transcription and in contrast to it, we have seen the newly launched Python Toolbox for music transcription called Omnizart. It transcribes the given audio files into more than 6 modes and we have discussed all those. Lastly, we have seen the practical implementation of it where we have taken music from YouTube and transcribed it in the Drum mode and the results are quite impressive.