Sound is a vital sense to us human beings. The pitch of the sound is a somewhat approximate measure of the frequency. High pitch corresponds to higher frequency; similarly, lower pitch denotes low frequency. What our auditory system does is that it tracks the relative difference in pitch, hence recognizing different sounds which have different characteristics of their own. A perfect example is when listening to a song. We can differentiate among the melodies of the song.
Today’s article is about Pitch Recognition, aka Pitch Estimation. This domain has received paramount attention in the past few decades due to its vitality in several fields ranging from music information retrieval to speech analysis. Traditionally what used to happen was either one could implement using time domain or frequency domain. These handcrafted models posed one problem – the need for annotated data. This is a pretty tedious and laborious task to obtain the frequency and temporal resolution required for training the model.
Marco Tagliasacchi, a research scientist at Google Research, presented a solution to the problem mentioned above, which solved missing annotated data in November 2019. In simple basic terms, this approach calculated the relatedness between different sounds rather than calculating the absolute. SPICE (Self-supervised PItCh Estimation) was designed on this idea and presented with the research paper.
The model consists of a convolutional encoder; this produces a singular scalar embedding that maps linearly with the pitch. Two signals are fed to the encoder (one reference and one random signal), and the author has defined the domain using constant-Q transform for convenience.
A loss function was devised, forcing the difference between the scalar embeddings to stay proportional to the already known difference.
Pitch, as we know, is well defined under the condition of it being harmonic; that is, it should contain components with integer multiples of the fundamental frequency. An important function of the model is determining when the output is reliable and meaningful. SPICE has been designed to learn the level of confidence of pitch recognition or estimation, you may say in a self-supervised manner.
The model was evaluated using publicly available datasets and outperformed the handcrafted models that, too, had no access to true labels or absolute values. For example, SPICE outperformed CREPE (Convolutional Representation for Pitch Estimation) and SWIPE (Sawtooth Waveform Inspired Pitch Estimator) on the MIR-1k dataset on four classes, namely clean, 20 dB, 10dB and 0dB.
Let’s look at a code implementation in the following parts. The following implementation is in reference to the official implementation.
Code Implementation of Pitch Recognition Using SPICE
Imports and Dependencies
# syntax for installing multiple libraries # timidity is a lightweight package for playing MIDI files # libsndfile for reading and writing audio files automatically. !sudo apt-get install -q -y timidity libsndfile1 ''' All the imports to deal with sound data pydub for manipulating audio files numba for fast machine code, parallelising python code librosa for audio and sound analysis music21 is python toolkit for computer aided musicology (CAM) ''' !pip install -q pydub numba==0.48 librosa music21
MIDI files – is Musical Instrument Digital Interface; these don’t contain any actual audio data like Wav file or mp3. Hence are smaller in size.
# directory operations, math operations and mathematical stats import logging import statistics import sys import math # displaying wav files, display in Google colab notebook cell itself from scipy.io import wavfile from IPython.display import Audio, Javascript # for computer aided musicology (study of music/sounds) # audiosegment for audio feature extraction, classification from pydub import AudioSegment import music21 # tensorflow for leveraging model import tensorflow as tf import tensorflow_hub as hub # manipulation of numeric data and sound data as well as plotting different graphs for audio import librosa from librosa import display as librosadisplay import numpy as np import matplotlib.pyplot as plt # toolkit for encoding binary data into ASCII from base64 import b64decode # built-in module logging allows writing status messages to a file or any # other output streams logger.setLevel(logging.ERROR) logger = logging.getLogger() # checking the tf and librosa versions print("librosa: %s" % librosa.__version__) print("tensorflow: %s" % tf.__version__)
Audio Input
NOTE: The following Javascript snippet has been taken from the official GitHub repository for making an interface for recording your input.
RECORD = """ const sleep = time => new Promise(resolve => setTimeout(resolve, time)) const b2text = blob => new Promise(resolve => { const reader = new FileReader() reader.onloadend = e => resolve(e.srcElement.result) reader.readAsDataURL(blob) }) var record = time => new Promise(async resolve => { stream = await navigator.mediaDevices.getUserMedia({ audio: true }) recorder = new MediaRecorder(stream) chunks = [] recorder.ondataavailable = e => chunks.push(e.data) recorder.start() await sleep(time) recorder.onstop = async ()=>{ blob = new Blob(chunks) text = await b2text(blob) resolve(text) } recorder.stop() }) """ def record(sec=5): try: from google.colab import output except ImportError: print('No possible to import output from google.colab') return '' else: print('Recording') display(Javascript(RECORD)) s = output.eval_js('record(%d)' % (sec*1000)) fname = 'recorded_audio.wav' print('Saving to', fname) b = b64decode(s.split(',')[1]) with open(fname, 'wb') as f: f.write(b) return fname
Created this function for choosing different inputs, which are as below:
- Record audio with interface right here in colab notebook
- Uploading from the local system.
- Using a file from Google drive (mount the drive on the notebook)
- Downloading file from the Internet.
# Input URL INPUT_SRC = 'https://storage.googleapis.com/download.tensorflow.org/data/c-scale-metronome.wav' print('Selected', INPUT_SRC) # condition to check for the url audio if INPUT_SRC == 'RECORD': # function for recording own voice, choose the duration as suitable but keep it less uploaded_file = record(5) elif INPUT_SRC == 'UPLOAD': try: # import from google storage from google.colab import files except ImportError: print("ImportError") else: uploaded = files.upload() for fn in uploaded.keys(): print('Uploaded file "{name}" of length {length} bytes'.format( name=fn, length=len(uploaded[fn]))) uploaded_file_name = next(iter(uploaded)) print('Uploaded file: ' + uploaded_file_name) elif INPUT_SOURCE.startswith('./drive/'): try: from google.colab import drive except ImportError: print("ImportError") else: # mount google drive for local audio drive.mount('/content/drive') gdrive_audio_file = 'name provided by you.wav' uploaded_file_name = INPUT_SOURCE elif INPUT_SOURCE.startswith('http'): !wget --no-check-certificate 'https://storage.googleapis.com/download.tensorflow.org/data/c-scale-metronome.wav' -O c-scale.wav uploaded_file_name = 'c-scale.wav'
Audio Data Preparation
SAMPLE_RATE = 16000 # Function to convert the user created audio to that format, which the model expects # it should be one channel and of 16k sample rate def convert_audio(user_file, output_file='converted_audio_file.wav'): # variable aud for audio from user file input aud = AudioSegment.from_file(user_file) aud = audio.set_frame_rate(SAMPLE_RATE).set_channels(1) # export the audio file in wav format so that we can listen to it audio.export(output_file, format="wav") return output_file # Converting to the expected format for the model, # the uploaded file name is at # the variable uploaded_file_name which can be set accordingly converted = convert_audio(uploaded_file_name) # Load audio samples from the wav file: sample_rate, audio_samples = wavfile.read(converted, 'rb') # printing some basic information about the audio. duration = len(audio_samples)/sample_rate # sample rate set to 16k earlier print(f'Sample rate: {sample_rate} Hz') # string formatting for duration at 2 decimal point print(f'Total duration: {duration:.2f}s') print(f'Size of the input: {len(audio_samples)}') # listen to the wav file. Audio(audio_samples, rate=sample_rate) Function for Getting the spectrogram. # visualize the audio as a waveform. _ = plt.plot(audio_samples) MAX_ABS_INT16 = 32768.0 # function for plotting spectrogram def plot_spect(x, sample_rate, show_black_and_white=False): # start for plot x_stft = np.abs(librosa.stft(x, n_fft=2048)) # matplotlib fig, ax = plt.subplots() # set small size for easy cell fig.set_size_inches(20, 10) # setting amplitude function x_stft_db = librosa.amplitude_to_db(x_stft, ref=np.max) if(show_black_and_white): # get the spect. plot librosadisplay.specshow(data=x_stft_db, y_axis='log', sr=sample_rate, cmap='gray_r') else: librosadisplay.specshow(data=x_stft_db, y_axis='log', sr=sample_rate) # color bar is necessary plt.colorbar(format='%+2.0f dB') plot_stft(audio_samples / MAX_ABS_INT16 , sample_rate=EXPECTED_SAMPLE) plt.show()
Model Execution
# Loading the SPICE model mod = hub.load("https://tfhub.dev/google/spice/2") # feed the audio to the SPICE tf.hub model to obtain pitch and uncertainty outputs as tensors. model_out = mod.signatures["serving_default"](tf.constant(audio_samples, tf.float32)) # pitch output for estimation pitch_out = model_output["pitch"] uncertainty_out = model_out["uncertainty"] # 'Uncertainty' means the inverse of confidence. confidence_out = 1.0 - uncertainty_out # again a plot for above uncertainty and confidence fig, ax = plt.subplots() fig.set_size_inches(20, 10) plt.plot(pitch_outputs, label='pitch') plt.plot(confidence_outputs, label='confidence') plt.legend(loc="lower right") plt.show()
Now we have to remove low scores of Pitch and plot them.
# store the values in a list confidence_out = list(confidence_out) # traverse the list pitch_out = [ float(x) for x in pitch_out] # indexing through the length indices = range(len (pitch_outputs)) confident_pitch_out = [ (i,p) for i, p, c in zip(indices, pitch_out, confidence_out) if c >= 0.9 ] confident_pitch_out_x, confident_pitch_out_y = zip(*confident_pitch_out)
Output
# plotting graph for higher pitch scores fig, ax = plt.subplots() fig.set_size_inches(20, 10) ax.set_ylim([0, 1]) plt.scatter(confident_pitch_out_x, confident_pitch_out_y, ) plt.scatter(confident_pitch_out_x, confident_pitch_out_y, c="r") plt.show()
The pitch values returned by SPICE are in the range of 0 – 1; we have to convert them to absolute pitch values in Hertz.
def out2hz(pitch_out): # These Constants have been taken from https://tfhub.dev/google/spice/2 PT_OFFSET = 25.58 PT_SLOPE = 63.07 FMIN = 10.0; BINS_PER_OCTAVE = 12.0; # formula cqt_bin = pitch_output * PT_SLOPE + PT_OFFSET; return FMIN * 2.0 ** (1.0 * cqt_bin / BINS_PER_OCTAVE) confident_pitch_value_hz = [ out2hz(p) for p in confident_pitch_out_y ]
Checking how good the prediction is, by overlaying the predicted pitches over the original spectrum. Changed the original spectrum to black and white for better visibility.
plot_stft(audio_samples / MAX_ABS_INT16 , sample_rate=EXPECTED_SAMPLE, show_black_and_white=True) # Conveniently, since the plot is in log scale, the pitch outputs # also get converted to the log scale automatically by matplotlib. plt.scatter(confident_pitch_out_x, confident_pitch_value_hz, c="r") plt.show()
Conversion to Musical Notes
Taking care when there is no singing, size of each note (different offsets). # we have to put zero where there is no singing. pitch_output_and_rest = [ out2hz(p) if c >= 0.9 else 0 for i, p, c in zip(indices, pitch_out, confidence_out) ]
Adding note offsets
A4 = 440 C0 = A4 * pow(2, -4.75) note_ = ["C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"] def hz2offset(freq): # Measures the quantization error for a single note. if freq == 0: # Rests always have zero error. return None # Quantized note. h = round(12 * math.log2(freq / C0)) return 12 * math.log2(freq / C0) - h # The ideal offset is the mean quantization error for all the notes # (excluding rests): offsets = [hz2offset(p) for p in pitch_output_and_rest if p != 0] print("offsets: ", offsets) ideal_offset = statistics.mean(offsets) print("ideal offset: ", ideal_offset)
Open Sheet Music Display
NOTE: The following is a Javascript snippet from official GitHub to develop an interface on screen for offsets display. from IPython.core.display import display, HTML, Javascript import json, random def showScore(score): xml = open(score.write('musicxml')).read() showMusicXML(xml) def showMusicXML(xml): DIV_ID = "OSMD_div" display(HTML('<div id="'+DIV_ID+'">loading OpenSheetMusicDisplay</div>')) script = """ var div_id = { {DIV_ID} }; function loadOSMD() { return new Promise(function(resolve, reject){ if (window.opensheetmusicdisplay) { return resolve(window.opensheetmusicdisplay) } // OSMD script has a 'define' call which conflicts with requirejs var _define = window.define // save the define object window.define = undefined // now the loaded script will ignore requirejs var s = document.createElement( 'script' ); s.setAttribute( 'src', "https://cdn.jsdelivr.net/npm/opensheetmusicdisplay@0.7.6/build/opensheetmusicdisplay.min.js" ); //s.setAttribute( 'src', "/custom/opensheetmusicdisplay.js" ); s.onload=function(){ window.define = _define resolve(opensheetmusicdisplay); }; document.body.appendChild( s ); // browser will try to load the new script tag }) } loadOSMD().then((OSMD)=>{ window.openSheetMusicDisplay = new OSMD.OpenSheetMusicDisplay(div_id, { drawingParameters: "compacttight" }); openSheetMusicDisplay .load({ {data} }) .then( function() { openSheetMusicDisplay.render(); } ); }) """.replace('{ {DIV_ID} }',DIV_ID).replace('{ {data} }',json.dumps(xml)) display(Javascript(script)) return
EndNote
We can easily listen back to the audio files by changing them into wav format. I recommend using different commands when recording audio, longer durations and different sounds from online resources. We successfully overcame traditional handcrafted problems in this article and developed a self-supervised technique for Pitch Estimation.