VGG-SOUND Datasets is Developed by VGG, Department of Engineering Science, University of Oxford, UK Audio VGGSound Dataset has set a benchmark for audio recognition with visuals. It contains more than 210 k videos with visual and audio. The dataset contains over 310 categorie and 550 hours of video. It is available to download for commercial/research purposes. The VGGSound dataset consists of each video and audio segment being 10 seconds long.
Andrew Zisserman, Andrea Vedaldi is Principal Researchers of VGG(Visual Geometry Group). The VGG-Sound researcher Honglie Chen, Weidi Xie, Andrea Vedaldi and Andrew Zisserman are the core members of VGG, Department of Engineering Science, University of Oxford, UK, published on ICASSP, 2020.
Paper: https://www.robots.ox.ac.uk/~vgg/publications/2020/Chen20/chen20.pdf
Image from original source of VGG-Sound.
Define Audio Recognition:
Audio Recognition is to solve the problem of classifying sounds. In-home digital assistants from Google assistant, Amazon Alexa, Apple Siri have all implemented voice recognition software to interact with users.
Download Dataset:
The dataset is available as a CSV file which contains the ‘youtube URL’ of the audio and video, click here to download locally on your computer.
Download Size: 8 MB
Now, Let’s do some coding to know about the dataset and their category ratio of training and testing data.
Visualization of Data
Import all these library pandas, matplotlib and seaborn and load the dataset using the code.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline df = pd.read_csv("vggsound.csv")
The dataset contains more than 310 classes which is broadly categorised as :
- People
- Animals
- Music
- Sports
- Nature
- Vehicle
- Tools
- Instruments
- Mammals
- Others
If you plot a pie chart of it then it looks messy like this.
df.groupby('people marching').size().plot(kind='pie', autopct='%.2f')
Now visualize the train and test data.
sns.catplot(x="1", y="test", data=df)
Plot the ratio of training and Test set:
df.groupby('people marching').size().plot(kind='pie', autopct='%.2f')
The dataset contains 92.25 per cent of training data and 7.75 per cent of test data as shown in the pie chart.
Implementation of VGG-Sound
Using PyTorch:
import os import cv2 import json import torch import csv import numpy as np from torch.utils.data import Dataset, DataLoader from torchvision import transforms, utils import time from PIL import Image import glob import sys from scipy import signal import random import soundfile as sf class GetAudioVideoDataset(Dataset): def __init__(self, args, mode='train', transforms=None): data2path = {} classes = [] classes_ = [] data = [] data2class = {} with open(args.csv_path + 'stat.csv') as f: csv_reader = csv.reader(f) for row in csv_reader: classes.append(row[0]) with open(args.csv_path + args.test) as f: csv_reader = csv.reader(f) for item in csv_reader: if item[1] in classes and os.path.exists(args.data_path + item[0][:-3] + 'wav'):: data.append(item[0]) data2class[item[0]] = item[1] self.audio_path = args.data_path self.mode = mode self.transforms = transforms self.classes = sorted(classes) self.data2class = data2class # initialize audio transform self._init_atransform() # Retrieve list of audio and video files self.video_files = [] for item in data: self.video_files.append(item) print('# of audio files = %d ' % len(self.video_files)) print('# of classes = %d' % len(self.classes)) def _init_atransform(self): self.aid_transform = transforms.Compose([transforms.ToTensor()]) def __len__(self): return len(self.video_files) def __getitem__(self, idx): wav_file = self.video_files[idx] # Audio samples, samplerate = sf.read(self.audio_path + wav_file[:-3]+'wav') # repeat in case audio is too short resamples = np.tile(samples,10)[:160000] resamples[resamples > 1.] = 1. resamples[resamples < -1.] = -1. frequencies, times, spectrogram = signal.spectrogram(resamples, samplerate, nperseg=512,noverlap=353) spectrogram = np.log(spectrogram+ 1e-7) mean = np.mean(spectrogram) std = np.std(spectrogram) spectrogram = np.divide(spectrogram-mean,std+1e-9) return spectrogram, resamples,self.classes.index(self.data2class[wav_file]),wav_file
To implement VGG-Sound Model refer the GitHub repo here.
Conclusion:
We have learned about VGGSound dataset, how we can download from the source. The plot of categorical. Visual and plot of the train and test data with there category.Implementation of model in PyTorch and using ResNet model and audio Recognition.
All source code refers here to plot the visuals.