Advertisement

Active Hackathon

Guide to YAMNet : Sound Event Classifier

YAMNet

Transfer Learning is a well-liked and popular machine learning technique in which one can train a model by reusing information learned from a previously existing model. You must have heard and read about common applications of transfer learning in the vision domain – training models to accurately classify images and do object detection or text-domain – sentiment analysis or question answering, and the list goes on …

We will learn how to apply transfer learning for a new(relatively) type of data: audio, by making a sound classifier. There are many vital use cases of sound classification, such as detecting whales and other creatures using sound as a necessity to travel, protecting wildlife from poaching and encroachment, etc.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

With YAMNet, we can easily create a sound classifier in a few simple and easy steps!

YAMNet (Yet Another Mobile Network) – Yes, that is the full form, is a pretrained acoustic detection model trained by Dan Ellis on the AudioSet dataset which contains labelled data from more than 2 million Youtube videos. It employs the MobileNet_v1 depth-wise-separable convolution architecture. This pretrained model is readily available in Tensorflow Hub, which includes TFLite(lite model for mobile) and TF.js(running on the web) versions.

Let’s better understand this amazing model with a practical use case and hands-on Python Implementation.

 Importing the Dependencies 

Importing tensorflow-hub for leveraging the pre-trained model, wavfile for storing an audio file.

IPython.display lets us play audio right here in the notebook.

 import tensorflow as tf
 import tensorflow_hub as hub
 import numpy as np
 import csv
 import matplotlib.pyplot as plt
 from IPython.display import Audio
 from scipy.io import wavfile 

Loading the Model

Instantiating the pre-trained model to a variable using hub.model method for usage in the below cells. The labels file will also be loaded from model assets and is present at model.class_map_path(). We require this to load it on the class_names variable later on.

 # Load the model.
 model = hub.load('https://tfhub.dev/google/yamnet/1')  

Helper_Function_1

This helper function is created to find the name of the class with the top score when mean-aggregated across frames.

 # Find the name of the class with the top score when mean-aggregated across frames.
 def class_names_from_csv(class_map_csv_text):
   """Returns list of class names corresponding to score vector."""
   class_names = []
   with tf.io.gfile.GFile(class_map_csv_text) as csvfile:
     reader = csv.DictReader(csvfile)
     for row in reader:
       class_names.append(row['display_name'])
   return class_names
 class_map_path = model.class_map_path().numpy()
 class_names = class_names_from_csv(class_map_path) 

Helper_Function_2

We need to add this method to verify and convert the loaded audio to the proper sample_rate which is 16K. This is mentioned in the YAMNet paper by the authors, as it can adversely affect our model results.

 def ensure_sample_rate(original_sample_rate, waveform,
                        desired_sample_rate=16000):
   """Resample waveform if required."""
   if original_sample_rate != desired_sample_rate:
     desired_length = int(round(float(len(waveform)) /
                                original_sample_rate * desired_sample_rate))
     waveform = scipy.signal.resample(waveform, desired_length)
   return desired_sample_rate, waveform 

Downloading and Preparing the sound file 

The Colab notebook will have all links required; you just have to run the notebook provided.

 !curl -O https://storage.googleapis.com/audioset/speech_whistling2.wav
 !curl -O https://storage.googleapis.com/audioset/miaow_16k.wav 

We can also listen to a sample audio file from the downloaded data set and check its properties by applying the following snippet. As shown below, we can play the sample audio file and look at some information about this particular audio file.  

 # wav_file_name = 'speech_whistling2.wav'
 wav_file_name = 'miaow_16k.wav'
 sample_rate, wav_data = wavfile.read(wav_file_name, 'rb')
 sample_rate, wav_data = ensure_sample_rate(sample_rate, wav_data)
 # Show some basic information about the audio.
 duration = len(wav_data)/sample_rate
 print(f'Sample rate: {sample_rate} Hz')
 print(f'Total duration: {duration:.2f}s')
 print(f'Size of the input: {len(wav_data)}')
 # Listening to the wav file.
 Audio(wav_data, rate=sample_rate) 

Running the Model 

We are converting the wave data into numbers to feed to the pre-trained model. The model will give us scores , embeddings, and spectrograms as output that we can later display. Helper_Function_1 gives us the output as “Animal” which means that this is the label with the maximum number of audio files in our dataset.

 waveform = wav_data / tf.int16.max
 # Run the model, check the output.
 scores, embeddings, spectrogram = model(waveform)
 scores_np = scores.numpy()
 spectrogram_np = spectrogram.numpy()
 infered_class = class_names[scores_np.mean(axis=0).argmax()]
 print(f'The main sound is: {infered_class}') 

Plotting the Output

Plotting the three outputs we got from running the model, namely scores, embeddings and spectrograms

 plt.figure(figsize=(10, 6))
 # Plot the waveform.
 plt.subplot(3, 1, 1)
 plt.plot(waveform)
 plt.xlim([0, len(waveform)])
 # Plot the log-mel spectrogram (returned by the model).
 plt.subplot(3, 1, 2)
 plt.imshow(spectrogram_np.T, aspect='auto', interpolation='nearest', origin='lower')
 # Plot and label the model output scores for the top-scoring classes.
 mean_scores = np.mean(scores, axis=0)
 top_n = 10
 top_class_indices = np.argsort(mean_scores)[::-1][:top_n]
 plt.subplot(3, 1, 3)
 plt.imshow(scores_np[:, top_class_indices].T, aspect='auto', interpolation='nearest', cmap='gray_r')
 # patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
 # values from the model documentation
 patch_padding = (0.025 / 2) / 0.01
 plt.xlim([-patch_padding-0.5, scores.shape[0] + patch_padding-0.5])
 # Label the top_N classes.
 yticks = range(0, top_n, 1)
 plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
 _ = plt.ylim(-0.5 + np.array([top_n, 0])) 

After running the above snippets, it will be displayed as below.

And here is your sound classifier using the pretrained model YAMNet. I recommend trying out this model with different datasets from open source and those present in the links present in this blog. This is a peculiar use case that will enhance your skillset in Deep Learning.

The Google Colab notebook is present here for reference. 

References:

More Great AIM Stories

Mudit Rustagi
Mudit is experienced in machine learning and deep learning. He is an undergraduate in Mechatronics and worked as a team lead (ML team) for several Projects. He has a strong interest in doing SOTA ML projects and writing blogs on data science and machine learning.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.