Active Hackathon

Guide To VGG-SOUND Datasets For Visual-Audio Recognition

Guide To VGG-SOUND Datasets For Visual-Audio Recognition

VGG-SOUND Datasets is Developed by VGG, Department of Engineering Science, University of Oxford, UK Audio VGGSound  Dataset has set a benchmark for audio recognition with visuals. It contains more than  210 k videos with visual and audio. The dataset contains over 310 categorie and 550 hours of video. It is available to download for commercial/research purposes. The VGGSound dataset consists of each video and audio segment being 10 seconds long.

Andrew Zisserman, Andrea Vedaldi is  Principal Researchers of VGG(Visual Geometry Group). The VGG-Sound researcher Honglie Chen, Weidi Xie, Andrea Vedaldi and Andrew Zisserman are the core members of  VGG, Department of Engineering Science, University of Oxford, UK, published on ICASSP, 2020.


Sign up for your weekly dose of what's up in emerging technology.


Image from original source of VGG-Sound.

Define Audio Recognition:

Audio Recognition is to solve the problem of classifying sounds. In-home digital assistants from Google assistant, Amazon Alexa, Apple Siri have all implemented voice recognition software to interact with users.

 Download Dataset:

The dataset is available as a CSV file which contains the ‘youtube URL’ of the audio and video,  click here to download locally on your computer.

Download Size: 8 MB

Now, Let’s do some coding to know about the dataset and their category ratio of training and testing data. 

Visualization of Data

 Import all these library pandas, matplotlib and seaborn and load the dataset using the code.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("vggsound.csv")

The dataset contains more than 310 classes which is broadly categorised as :

  1. People
  2. Animals
  3. Music
  4. Sports
  5. Nature
  6. Vehicle
  7. Tools
  8. Instruments
  9. Mammals
  10. Others

If you plot a pie chart of it then it looks messy like this.

df.groupby('people marching').size().plot(kind='pie', autopct='%.2f')

Now visualize the train and test data.

sns.catplot(x="1", y="test", data=df)

Plot the ratio of training and Test set:

df.groupby('people marching').size().plot(kind='pie', autopct='%.2f')

The dataset contains 92.25 per cent of training data and 7.75 per cent of test data as shown in the pie chart.

Implementation of VGG-Sound

Using PyTorch:

import os
import cv2
import json
import torch
import csv
import numpy as np
from import Dataset, DataLoader
from torchvision import transforms, utils
import time
from PIL import Image
import glob
import sys
from scipy import signal
import random
import soundfile as sf
class GetAudioVideoDataset(Dataset):
    def __init__(self, args, mode='train', transforms=None):
        data2path = {}
        classes = []
        classes_ = []
        data = []
        data2class = {}
        with open(args.csv_path + 'stat.csv') as f:
            csv_reader = csv.reader(f)
            for row in csv_reader:
        with open(args.csv_path  + args.test) as f:
            csv_reader = csv.reader(f)
            for item in csv_reader:
                if item[1] in classes and os.path.exists(args.data_path + item[0][:-3] + 'wav')::
                    data2class[item[0]] = item[1]
        self.audio_path = args.data_path 
        self.mode = mode
        self.transforms = transforms
        self.classes = sorted(classes)
        self.data2class = data2class
        # initialize audio transform
        #  Retrieve list of audio and video files
        self.video_files = []
        for item in data:
        print('# of audio files = %d ' % len(self.video_files))
        print('# of classes = %d' % len(self.classes))
    def _init_atransform(self):
        self.aid_transform = transforms.Compose([transforms.ToTensor()])
    def __len__(self):
        return len(self.video_files)  
    def __getitem__(self, idx):
        wav_file = self.video_files[idx]
        # Audio
        samples, samplerate = + wav_file[:-3]+'wav')
        # repeat in case audio is too short
        resamples = np.tile(samples,10)[:160000]
        resamples[resamples > 1.] = 1.
        resamples[resamples < -1.] = -1.
        frequencies, times, spectrogram = signal.spectrogram(resamples, samplerate, nperseg=512,noverlap=353)
        spectrogram = np.log(spectrogram+ 1e-7)
        mean = np.mean(spectrogram)
        std = np.std(spectrogram)
        spectrogram = np.divide(spectrogram-mean,std+1e-9)
        return spectrogram, resamples,self.classes.index(self.data2class[wav_file]),wav_file

To implement VGG-Sound Model refer the GitHub repo here.


We have learned about VGGSound dataset, how we can download from the source. The plot of categorical. Visual and plot of the train and test data with there category.Implementation of model in PyTorch and using ResNet model and audio Recognition.

All source code refers here to plot the visuals.

More Great AIM Stories

Amit Singh
Amit Singh is Data Scientist, graduated in Computer Science and Engineering. Data Science writer at Analytics India Magazine.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.