Data science deals with multiple formats of data. At the basic level, we start with the CSV and Excel files. As we dig deeper into it, every source of information can be considered as data, no matter what kind of information it is. Different kinds of data such as images, video or audio can tell us a lot of information. In Audio data analysis, we do many operations with audio data like automatic speech recognition, digital signal processing, music classification, etc. In this article, we are trying to analyze the audio data; Audio data is an unstructured form of data and to work with it, we need to make it structured.
Introduction to Audio Data
Listening to audio in an environment is what we do in our daily life, and our mind works with the information provided by the data and tries to make decisions according to it. But in computer science, there are many formats of audio data. Some of the examples are:
Sign up for your weekly dose of what's up in emerging technology.
- Mp3 format
- Wav format
- Wma format
- Flac format
In working with audio data, one of the biggest challenges is preparing an audio file. It has two parts to it, in the time domain and frequency domain. Time is considered an independent variable in the time domain, and its graph shows us how signals are changing over time. In contrast, the frequency domain considers the frequency of the signal as an independent variable. Its graph shows how much of the signal lies in each frequency band over a range of frequencies, making audio data analysis more difficult to perform.
Tensorflow ecosystem provides a TensorFlow-io package for the preparation of audio data.
Getting started with the Code Implementation
In this article, we are going to make a flac format audio file brooklyn.flac structured using TensorFlow, which is publicly available via google cloud. The address for the file is
Setting up google colab environment :
Installing required package :
!pip install tensorflow-io
import tensorflow as tf import tensorflow_io as tfio from IPython.display import Audio import matplotlib.pyplot as plt
Read brooklyn.flac audio file.
audio = tfio.audio.AudioIOTensor('gs://cloud-samples-tests/speech/brooklyn.flac')print(audio)
The file we have read in the above output is a mono channel audio file with 28979 samples in int16. The file’s content can only be read by converting it to tensor using to_tensor() or slicing.
audio_slice = audio[100:] # remove last dimension audio_tensor = tf.squeeze(audio_slice, axis=) print(audio_tensor)
The audio can be play through:
from IPython.display import Audio Audio(audio_tensor.numpy(), rate=audio.rate.numpy())
To understand the audio quality, it is a better option to make a graph about the frequency of the audio waves. We can perform it using matplotlib.pyplot .
tensor = tf.cast(audio_tensor, tf.float32) / 32768.0 plt.figure() plt.plot(tensor.numpy())
Here we can see that in the graph, with respect to the loudness, the frequency of the graph is changing.
Let’s trim the noise in the audio.
Noise is an unwanted sound in audio data that can be considered as an unpleasant sound. Trimming of the noise can be done by using tfio.audio.trim api or the tensorflow.
position = tfio.audio.trim(tensor, axis=0, epsilon=0.1) print(position) start = position stop = position print(start, stop) processed = tensor[start:stop] plt.figure() plt.plot(processed.numpy()) Audio(audio_tensor.numpy(), rate=audio.rate.numpy())
Here we can see the unwanted frequency of the audio is deleted in the audio data.
Fade in and fade out.
In audio analysis, the fade out and fade in is a technique where we gradually lose or gain the frequency of the audio using TensorFlow, it can be done by:
fade = tfio.audio.fade( processed, fade_in=1000, fade_out=2000, mode="logarithmic") plt.figure() plt.plot(fade.numpy())
After fading in, we can listen to a low-frequency sound.
fade = tfio.audio.fade(processed, fade_in=1000, fade_out=2000, mode="logarithmic") plt.figure() plt.plot(fade.numpy())
A spectrogram is a graph that represents the concentration of the frequency of the audio data. This means the brighter color in the spectrogram has a more concentrated sound than the darker color in the spectrogram, where the sound is nearly empty.
To make an spectrogram of the audio file we are using tfio.audio.spectrogram :
# Convert to spectrogram spectrogram = tfio.audio.spectrogram( fade, nfft=512, window=512, stride=256) plt.figure() plt.imshow(tf.math.log(spectrogram).numpy())
The mel scale is the scale of pitches felt by the listener present in the same distance from one another. Mel spectrogram is a spectrogram where spectrum frequencies are converted into mel scale. The db scale mel spectrogram is a spectrogram that creates a graph between log scaled frequency and pitches. We are making a mel spectrogram and a db scale male spectrogram of our audio in this step.
# Convert to mel-spectrogram mel_spectrogram = tfio.audio.melscale( spectrogram, rate=16000, mels=128, fmin=0, fmax=8000) plt.figure() plt.imshow(tf.math.log(mel_spectrogram).numpy()) # Convert to db scale mel-spectrogram dbscale_mel_spectrogram = tfio.audio.dbscale( mel_spectrogram, top_db=80) plt.figure() plt.imshow(dbscale_mel_spectrogram.numpy())
In audio data analysis, removal of noise is a required practice. For noise removal, frequency and time masking is a better approach to work nicely.
In frequency masking, we eliminate quieter sounds from the sound to make the audio file clearly audible.
# Freq masking import tensorflow_io as tfio freq_mask = tfio.audio.freq_mask(dbscale_mel_spectrogram, param=10) plt.figure() plt.imshow(freq_mask.numpy())
It can be the case, in an audio file, the presence of some quieter sounds are difficult to judge at the same time step. Time masking is a process of eliminating quieter sounds lying in the same time step from the audio.
# Time masking time_mask = tfio.audio.time_mask(dbscale_mel_spectrogram, param=10) plt.figure() plt.imshow(time_mask.numpy())
So here in this article, we have seen what an audio file is, how to analyse the frequency and pitch of the audio file making different spectrograms and how and why to do frequency masking and time masking using a tensor overflow package called TensorFlow-io.