Now Reading
A Guide To Audio Data Preparation Using TensorFlow

A Guide To Audio Data Preparation Using TensorFlow

Data science deals with multiple formats of data. At the basic level, we start with the CSV and Excel files. As we dig deeper into it, every source of information can be considered as data, no matter what kind of information it is. Different kinds of data such as images, video or audio can tell us a lot of information. In Audio data analysis, we do many operations with audio data like automatic speech recognition, digital signal processing, music classification, etc.  In this article, we are trying to analyze the audio data; Audio data is an unstructured form of data and to work with it, we need to make it structured.

Introduction to Audio Data 

Listening to audio in an environment is what we do in our daily life, and our mind works with the information provided by the data and tries to make decisions according to it. But in computer science, there are many formats of audio data. Some of the examples are:

Deep Learning DevCon 2021 | 23-24th Sep | Register>>
  • Mp3 format
  • Wav format
  • Wma format
  • Flac format 

In working with audio data, one of the biggest challenges is preparing an audio file. It has two parts to it, in the time domain and frequency domain. Time is considered an independent variable in the time domain, and its graph shows us how signals are changing over time. In contrast, the frequency domain considers the frequency of the signal as an independent variable. Its graph shows how much of the signal lies in each frequency band over a range of frequencies, making audio data analysis more difficult to perform.

Image source

 Tensorflow ecosystem provides a TensorFlow-io package for the preparation of audio data.

Getting started with the Code Implementation

In this article, we are going to make a flac format audio file brooklyn.flac structured using TensorFlow, which is publicly available via google cloud. The address for the file is   

Looking for a job change? Let us help you.

gs://cloud-samples-tests/speech/brooklyn.flac . 

Setting up google colab environment :

Installing required package :

!pip install tensorflow-io

Importing libraries.

 import tensorflow as tf
 import tensorflow_io as tfio
 from IPython.display import Audio
 import matplotlib.pyplot as plt 

Read brooklyn.flac audio file.

input:

 audio = tfio.audio.AudioIOTensor('gs://cloud-samples-tests/speech/brooklyn.flac')
 print(audio) 

Output:

The file we have read in the above output is a mono channel audio file with 28979 samples in int16. The file’s content can only be read by converting it to tensor using to_tensor() or slicing.

Input :

 audio_slice = audio[100:]
 # remove last dimension
 audio_tensor = tf.squeeze(audio_slice, axis=[1])
 print(audio_tensor) 

Output:

The audio can be play through:

Input :

 from IPython.display import Audio
 Audio(audio_tensor.numpy(), rate=audio.rate.numpy()) 

Output:

To understand the audio quality, it is a better option to make a graph about the frequency of the audio waves. We can perform it using matplotlib.pyplot .

Input :

 tensor = tf.cast(audio_tensor, tf.float32) / 32768.0
 plt.figure()
 plt.plot(tensor.numpy()) 

Output:

Here we can see that in the graph, with respect to the loudness, the frequency of the graph is changing.

Let’s trim the noise in the audio.

Noise is an unwanted sound in audio data that can be considered as an unpleasant sound. Trimming of the noise can be done by using tfio.audio.trim api or the tensorflow.

Input :

 position = tfio.audio.trim(tensor, axis=0, epsilon=0.1)
 print(position)
 start = position[0]
 stop = position[1]
 print(start, stop)
 processed = tensor[start:stop]
 plt.figure()
 plt.plot(processed.numpy())
 Audio(audio_tensor.numpy(), rate=audio.rate.numpy()) 

Output :

Here we can see the unwanted frequency of the audio is deleted in the audio data.

Fade in and fade out.

In audio analysis, the fade out and fade in is a technique where we gradually lose or gain the frequency of the audio using TensorFlow, it can be done by:

Input :

 fade = tfio.audio.fade(
     processed, fade_in=1000, fade_out=2000, mode="logarithmic")
 plt.figure()
 plt.plot(fade.numpy()) 

Output :

After fading in, we can listen to a low-frequency sound.

Input :

fade = tfio.audio.fade(processed, fade_in=1000, fade_out=2000, mode="logarithmic")
 
plt.figure()
plt.plot(fade.numpy())

Output:

Spectrogram

A spectrogram is a graph that represents the concentration of the frequency of the audio data. This means the brighter color in the spectrogram has a more concentrated sound than the darker color in the spectrogram, where the sound is nearly empty.

To make an spectrogram of the audio file we are using tfio.audio.spectrogram :

Input:

 # Convert to spectrogram
 spectrogram = tfio.audio.spectrogram(
     fade, nfft=512, window=512, stride=256)
 plt.figure()
 plt.imshow(tf.math.log(spectrogram).numpy()) 

Output:

The mel scale is the scale of pitches felt by the listener present in the same distance from one another. Mel spectrogram is a spectrogram where spectrum frequencies are converted into mel scale. The db scale mel spectrogram is a spectrogram that creates a graph between log scaled frequency and pitches. We are making a mel spectrogram and a db scale male spectrogram of our audio in this step.

Input :

 # Convert to mel-spectrogram
 mel_spectrogram = tfio.audio.melscale(
     spectrogram, rate=16000, mels=128, fmin=0, fmax=8000)
 plt.figure()
 plt.imshow(tf.math.log(mel_spectrogram).numpy())
 # Convert to db scale mel-spectrogram
 dbscale_mel_spectrogram = tfio.audio.dbscale(
     mel_spectrogram, top_db=80)
 plt.figure()
 plt.imshow(dbscale_mel_spectrogram.numpy()) 

Output:

 In audio data analysis, removal of noise is a required practice. For noise removal, frequency and time masking is a better approach to work nicely.

Frequency masking

In frequency masking, we eliminate quieter sounds from the sound to make the audio file clearly audible.

Input :

 # Freq masking
 import tensorflow_io as tfio
 freq_mask = tfio.audio.freq_mask(dbscale_mel_spectrogram, param=10)
 plt.figure()
 plt.imshow(freq_mask.numpy()) 

Output:

Time masking

It can be the case, in an audio file, the presence of some quieter sounds are difficult to judge at the same time step. Time masking is a process of eliminating quieter sounds lying in the same time step from the audio. 

Input:

 # Time masking
 time_mask = tfio.audio.time_mask(dbscale_mel_spectrogram, param=10)
 plt.figure()
 plt.imshow(time_mask.numpy()) 

Output:

So here in this article, we have seen what an audio file is, how to analyse the frequency and pitch of the audio file making different spectrograms and how and why to do frequency masking and time masking using a tensor overflow package called TensorFlow-io.

References

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top