MITB Banner

Sound Pitch Recognition Using SPICE

Article is about Pitch Recognition, aka Pitch Estimation.

Sound is a vital sense to us human beings. The pitch of the sound is a somewhat approximate measure of the frequency. High pitch corresponds to higher frequency; similarly, lower pitch denotes low frequency. What our auditory system does is that it tracks the relative difference in pitch, hence recognizing different sounds which have different characteristics of their own. A perfect example is when listening to a song. We can differentiate among the melodies of the song. 

Today’s article is about Pitch Recognition, aka Pitch Estimation. This domain has received paramount attention in the past few decades due to its vitality in several fields ranging from music information retrieval to speech analysis. Traditionally what used to happen was either one could implement using time domain or frequency domain. These handcrafted models posed one problem – the need for annotated data. This is a pretty tedious and laborious task to obtain the frequency and temporal resolution required for training the model.

Marco Tagliasacchi, a research scientist at Google Research, presented a solution to the problem mentioned above, which solved missing annotated data in November 2019. In simple basic terms, this approach calculated the relatedness between different sounds rather than calculating the absolute.  SPICE (Self-supervised PItCh Estimation) was designed on this idea and presented with the research paper.

The model consists of a convolutional encoder; this produces a singular scalar embedding that maps linearly with the pitch. Two signals are fed to the encoder (one reference and one random signal), and the author has defined the domain using constant-Q transform for convenience. 

A loss function was devised, forcing the difference between the scalar embeddings to stay proportional to the already known difference.

Pitch, as we know, is well defined under the condition of it being harmonic; that is, it should contain components with integer multiples of the fundamental frequency. An important function of the model is determining when the output is reliable and meaningful. SPICE has been designed to learn the level of confidence of pitch recognition or estimation, you may say in a self-supervised manner.

The model was evaluated using publicly available datasets and outperformed the handcrafted models that, too, had no access to true labels or absolute values. For example, SPICE outperformed CREPE (Convolutional Representation for Pitch Estimation) and SWIPE (Sawtooth Waveform Inspired Pitch Estimator) on the MIR-1k dataset on four classes, namely clean, 20 dB, 10dB and 0dB.

Let’s look at a code implementation in the following parts. The following implementation is in reference to the official implementation.

Code Implementation of  Pitch Recognition Using SPICE

Imports and Dependencies
 # syntax for installing multiple libraries
 # timidity is a lightweight package for playing MIDI files
 # libsndfile for reading and writing audio files automatically.
 !sudo apt-get install -q -y timidity libsndfile1
 '''
 All the imports to deal with sound data
 pydub for manipulating audio files
 numba for fast machine code, parallelising python code
 librosa for audio and sound analysis
 music21 is python toolkit for computer aided musicology (CAM)  
 '''
 !pip install -q pydub numba==0.48 librosa music21 

MIDI files – is Musical Instrument Digital Interface; these don’t contain any actual audio data like Wav file or mp3. Hence are smaller in size.

 # directory operations, math operations and mathematical stats
 import logging
 import statistics
 import sys
 import math
 # displaying wav files, display in Google colab notebook cell itself 
 from scipy.io import wavfile
 from IPython.display import Audio, Javascript
 # for computer aided musicology (study of music/sounds)
 # audiosegment for audio feature extraction, classification 
 from pydub import AudioSegment
 import music21
 # tensorflow for leveraging model
 import tensorflow as tf
 import tensorflow_hub as hub
 # manipulation of numeric data and sound data as well as plotting different graphs for audio
 import librosa
 from librosa import display as librosadisplay
 import numpy as np
 import matplotlib.pyplot as plt
 # toolkit for encoding binary data into ASCII
 from base64 import b64decode
 # built-in module logging allows writing status messages to a file or any
 # other output streams
 logger.setLevel(logging.ERROR)
 logger = logging.getLogger()
 # checking the tf and librosa versions
 print("librosa: %s" % librosa.__version__)
 print("tensorflow: %s" % tf.__version__) 
Audio Input

NOTE: The following Javascript snippet has been taken from the official GitHub repository for making an interface for recording your input.

 RECORD = """
 const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
 const b2text = blob => new Promise(resolve => {
   const reader = new FileReader()
   reader.onloadend = e => resolve(e.srcElement.result)
   reader.readAsDataURL(blob)
 })
 var record = time => new Promise(async resolve => {
   stream = await navigator.mediaDevices.getUserMedia({ audio: true })
   recorder = new MediaRecorder(stream)
   chunks = []
   recorder.ondataavailable = e => chunks.push(e.data)
   recorder.start()
   await sleep(time)
   recorder.onstop = async ()=>{
     blob = new Blob(chunks)
     text = await b2text(blob)
     resolve(text)
   }
   recorder.stop()
 })
 """
 def record(sec=5):
   try:
     from google.colab import output
   except ImportError:
     print('No possible to import output from google.colab')
     return ''
   else:
     print('Recording')
     display(Javascript(RECORD))
     s = output.eval_js('record(%d)' % (sec*1000))
     fname = 'recorded_audio.wav'
     print('Saving to', fname)
     b = b64decode(s.split(',')[1])
     with open(fname, 'wb') as f:
       f.write(b)
     return fname 

Created this function for choosing different inputs, which are as below:

  1. Record audio with interface right here in colab notebook
  2. Uploading from the local system.
  3. Using a file from Google drive (mount the drive on the notebook)
  4. Downloading file from the Internet.
 # Input URL
 INPUT_SRC = 'https://storage.googleapis.com/download.tensorflow.org/data/c-scale-metronome.wav'
 print('Selected', INPUT_SRC)
 # condition to check for the url audio
 if INPUT_SRC == 'RECORD':
   # function for recording own voice, choose the duration as suitable but keep it less
   uploaded_file = record(5)
 elif INPUT_SRC == 'UPLOAD':
   try:
     # import from google storage
     from google.colab import files
   except ImportError:
     print("ImportError")
   else:
     uploaded = files.upload()
     for fn in uploaded.keys():
       print('Uploaded file "{name}" of length {length} bytes'.format(
           name=fn, length=len(uploaded[fn])))
     uploaded_file_name = next(iter(uploaded))
     print('Uploaded file: ' + uploaded_file_name)
 elif INPUT_SOURCE.startswith('./drive/'):
   try:
     from google.colab import drive
   except ImportError:
     print("ImportError")
   else:
     # mount google drive for local audio
     drive.mount('/content/drive')
     gdrive_audio_file = 'name provided by you.wav'
     uploaded_file_name = INPUT_SOURCE
 elif INPUT_SOURCE.startswith('http'):
   !wget --no-check-certificate 'https://storage.googleapis.com/download.tensorflow.org/data/c-scale-metronome.wav' -O c-scale.wav
   uploaded_file_name = 'c-scale.wav' 
Audio Data Preparation 
 SAMPLE_RATE = 16000
 # Function to convert the user created audio to that format, which the model expects
 # it should be one channel and of 16k sample rate 
 def convert_audio(user_file, output_file='converted_audio_file.wav'):
   # variable aud for audio from user file input
   aud = AudioSegment.from_file(user_file)
   aud = audio.set_frame_rate(SAMPLE_RATE).set_channels(1)
   # export the audio file in wav format so that we can listen to it 
   audio.export(output_file, format="wav")
   return output_file
 # Converting to the expected format for the model,
 # the uploaded file name is at
 # the variable uploaded_file_name which can be set accordingly
 converted = convert_audio(uploaded_file_name)
 # Load audio samples from the wav file:
 sample_rate, audio_samples = wavfile.read(converted, 'rb')
 # printing some basic information about the audio.
 duration = len(audio_samples)/sample_rate
 # sample rate set to 16k earlier
 print(f'Sample rate: {sample_rate} Hz')
 # string formatting for duration at 2 decimal point
 print(f'Total duration: {duration:.2f}s')
 print(f'Size of the input: {len(audio_samples)}')
 # listen to the wav file.
 Audio(audio_samples, rate=sample_rate)
 Function for Getting the spectrogram.
 # visualize the audio as a waveform.
 _ = plt.plot(audio_samples)
 MAX_ABS_INT16 = 32768.0
 # function for plotting spectrogram
 def plot_spect(x, sample_rate, show_black_and_white=False):
   # start for plot
   x_stft = np.abs(librosa.stft(x, n_fft=2048))
   # matplotlib 
   fig, ax = plt.subplots()
   # set small size for easy cell
   fig.set_size_inches(20, 10)
   # setting amplitude function
   x_stft_db = librosa.amplitude_to_db(x_stft, ref=np.max)
   if(show_black_and_white):
     # get the spect. plot
     librosadisplay.specshow(data=x_stft_db, y_axis='log', 
                              sr=sample_rate, cmap='gray_r')
   else:
     librosadisplay.specshow(data=x_stft_db, y_axis='log', sr=sample_rate)
   # color bar is necessary
   plt.colorbar(format='%+2.0f dB')
 plot_stft(audio_samples / MAX_ABS_INT16 , sample_rate=EXPECTED_SAMPLE)
 plt.show() 
Model Execution
 # Loading the SPICE model 
 mod = hub.load("https://tfhub.dev/google/spice/2")
 # feed the audio to the SPICE tf.hub model to obtain pitch and uncertainty outputs as tensors.
 model_out = mod.signatures["serving_default"](tf.constant(audio_samples, tf.float32))
 # pitch output for estimation
 pitch_out = model_output["pitch"]
 uncertainty_out = model_out["uncertainty"]
 # 'Uncertainty' means the inverse of confidence.
 confidence_out = 1.0 - uncertainty_out
 # again a plot for above uncertainty and confidence
 fig, ax = plt.subplots()
 fig.set_size_inches(20, 10)
 plt.plot(pitch_outputs, label='pitch')
 plt.plot(confidence_outputs, label='confidence')
 plt.legend(loc="lower right")
 plt.show() 

Now we have to remove low scores of Pitch and plot them.

 # store the values in a list
 confidence_out = list(confidence_out)
 # traverse the list
 pitch_out = [ float(x) for x in pitch_out]
 # indexing through the length
 indices = range(len (pitch_outputs))
 confident_pitch_out = [ (i,p)  
   for i, p, c in zip(indices, pitch_out, confidence_out) if  c >= 0.9  ]
 confident_pitch_out_x, confident_pitch_out_y = zip(*confident_pitch_out) 
Output
 # plotting graph for higher pitch scores
 fig, ax = plt.subplots()
 fig.set_size_inches(20, 10)
 ax.set_ylim([0, 1])
 plt.scatter(confident_pitch_out_x, confident_pitch_out_y, )
 plt.scatter(confident_pitch_out_x, confident_pitch_out_y, c="r")
 plt.show() 

The pitch values returned by SPICE are in the range of 0 – 1; we have to convert them to absolute pitch values in Hertz.

 def out2hz(pitch_out):
   # These Constants have been taken from https://tfhub.dev/google/spice/2
   PT_OFFSET = 25.58
   PT_SLOPE = 63.07
   FMIN = 10.0;
   BINS_PER_OCTAVE = 12.0;
   # formula 
   cqt_bin = pitch_output * PT_SLOPE + PT_OFFSET;
   return FMIN * 2.0 ** (1.0 * cqt_bin / BINS_PER_OCTAVE)
 confident_pitch_value_hz = [ out2hz(p) for p in confident_pitch_out_y ] 

Checking how good the prediction is, by overlaying the predicted pitches over the original spectrum. Changed the original spectrum to black and white for better visibility.

 plot_stft(audio_samples / MAX_ABS_INT16 , 
           sample_rate=EXPECTED_SAMPLE, show_black_and_white=True)
 # Conveniently, since the plot is in log scale, the pitch outputs 
 # also get converted to the log scale automatically by matplotlib.
 plt.scatter(confident_pitch_out_x, confident_pitch_value_hz, c="r")
 plt.show() 
Conversion to Musical Notes
 Taking care when there is no singing, size of each note (different offsets).
 # we have to put zero where there is no singing.
 pitch_output_and_rest = [
     out2hz(p) if c >= 0.9 else 0
     for i, p, c in zip(indices, pitch_out, confidence_out)
 ] 

Adding note offsets

 A4 = 440
 C0 = A4 * pow(2, -4.75)
 note_ = ["C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"]
 def hz2offset(freq):
   # Measures the quantization error for a single note.
   if freq == 0:  
     # Rests always have zero error.
     return None
   # Quantized note.
   h = round(12 * math.log2(freq / C0))
   return 12 * math.log2(freq / C0) - h
 # The ideal offset is the mean quantization error for all the notes
 # (excluding rests):
 offsets = [hz2offset(p) for p in pitch_output_and_rest if p != 0]
 print("offsets: ", offsets)
 ideal_offset = statistics.mean(offsets)
 print("ideal offset: ", ideal_offset) 
Open Sheet Music Display
 NOTE: The following is a Javascript snippet from official GitHub to develop an interface on screen for offsets display.
 from IPython.core.display import display, HTML, Javascript
 import json, random
 def showScore(score):
     xml = open(score.write('musicxml')).read()
     showMusicXML(xml)
 def showMusicXML(xml):
     DIV_ID = "OSMD_div"
     display(HTML('<div id="'+DIV_ID+'">loading OpenSheetMusicDisplay</div>'))
     script = """
     var div_id = { {DIV_ID} };
     function loadOSMD() { 
         return new Promise(function(resolve, reject){
             if (window.opensheetmusicdisplay) {
                 return resolve(window.opensheetmusicdisplay)
             }
             // OSMD script has a 'define' call which conflicts with requirejs
             var _define = window.define // save the define object 
             window.define = undefined // now the loaded script will ignore requirejs
             var s = document.createElement( 'script' );
             s.setAttribute( 'src', "https://cdn.jsdelivr.net/npm/opensheetmusicdisplay@0.7.6/build/opensheetmusicdisplay.min.js" );
             //s.setAttribute( 'src', "/custom/opensheetmusicdisplay.js" );
             s.onload=function(){
                 window.define = _define
                 resolve(opensheetmusicdisplay);
             };
             document.body.appendChild( s ); // browser will try to load the new script tag
         }) 
     }
     loadOSMD().then((OSMD)=>{
         window.openSheetMusicDisplay = new OSMD.OpenSheetMusicDisplay(div_id, {
           drawingParameters: "compacttight"
         });
         openSheetMusicDisplay
             .load({ {data} })
             .then(
               function() {
                 openSheetMusicDisplay.render();
               }
             );
     })
     """.replace('{ {DIV_ID} }',DIV_ID).replace('{ {data} }',json.dumps(xml))
     display(Javascript(script))
     return 

EndNote

We can easily listen back to the audio files by changing them into wav format. I recommend using different commands when recording audio, longer durations and different sounds from online resources. We successfully overcame traditional handcrafted problems in this article and developed a self-supervised technique for Pitch Estimation.

References:

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Mudit Rustagi

Mudit Rustagi

Mudit is experienced in machine learning and deep learning. He is an undergraduate in Mechatronics and worked as a team lead (ML team) for several Projects. He has a strong interest in doing SOTA ML projects and writing blogs on data science and machine learning.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories