Guide To VoxCeleb Datasets For Audio-Visual of Human Speech

Guide To VoxCeleb Datasets For Visual-Audio of Human Speech.

Audio-Visual datasets is used in industry such, e.g. Alexa voice service as automatic speech recognition. In health care, the voice is routed through a speech-recognition machine for lip reading of patient, military services as High-performance fighter aircraft and digital detection system, lip-reading without any voice of dumb people, language learning as a second language.

VoxCeleb Dataset is developed by the VGG, Department of Engineering Science, University of Oxford, UK. To visit the VGG Dataset, click here. These are the primary researchers who work on creating VoxCeleb dataset from youtube. 

Arsha Nagrani is a Research Scientist at Google AI Research, focused on machine learning for video understanding. Joon Son Chung and  Weidi Xie, a research fellow at Visual Geometry Group, where he works on computer vision, deep learning, biomedical image analysis.


VoxCeleb contains over 1 million utterances for over 7,000 celebrities, extracted from videos uploaded to YouTube.

VoxCeleb Datasets is of two kinds, one is a large-scale speaker identification datasets, and the other one is Large-scale speaker verification in the wild. VoxCeleb1 dataset contains over 100,000 utterances for 1,251 celebrities and VoxCeleb2 dataset contains over a million utterances for 6,112 identities.

The ratio of Dataset on the basis of gender consists of  61 percent of male and 39 percent of female.

Image from Original dataset source.

As per shown in the pie chart, there are mainly five countries have a large number of contribution namely:

  1. U.S.A
  2. U.K
  3. Germany
  4. India
  5. France

Many other countries speakers are also present in the dataset but broadly plot the chart.

The DataSet is split in the train and test set, So you don’t have to worry about that. 

To download the dataset visit the following page. You need to send a request form to get access to the Dataset Only for Research purposes.





The implementation of the original dataset is given in the Matlab as per the source. You can also implement it using PyTorch.

Using PyTorch:


import pandas as pd
import numpy as np
from import Dataset
import utility as ut
import os
import torch
class AudioDataset(Dataset):
    """Audio dataset"""
    def __init__(self, csv_file, base_audio_path, stft_transform=None):
        self._base_audio_path = base_audio_path
        self._table = pd.read_csv(csv_file)
        self._audio_data = {}
        self._stft_transform = stft_transform
        # removed samples from *.csv whose *.wav files are not available
        indices_to_remove = []
        for idx in range(len(self._table)):
            wav_name = os.path.join(self._base_audio_path, self._table.wav_name[idx],)
            if not os.path.exists(wav_name):
        self._table = self._table.drop(indices_to_remove)
        self._table = self._table.reset_index()
    def __len__(self):
        return len(self._table)
    def __getitem__(self, idx):
        wav_name = os.path.join(self._base_audio_path, self._table.wav_name[idx])
        if wav_name in self._audio_data:
            wav_data = self._audio_data[wav_name]
            wav_data = ut.load_audio_sample(wav_name)
            self._audio_data[wav_name] = wav_data
        # create sample
        wav_data = ut.create_audio_sample(wav_data)
        audio_stft = ut.extract_spectrum(wav_data)
        audio_stft = np.vstack((audio_stft.real, audio_stft.imag))
        if self._stft_transform:
            audio_stft = self._stft_transform(audio_stft)
        audio_stft = audio_stft.reshape((1, *audio_stft.shape))
        audio_stft = torch.from_numpy(audio_stft.astype(dtype=np.float32))
        labels =torch.from_numpy(np.array([idx]).astype(dtype=np.float32))
        return (audio_stft, labels)
The implementation of the VoxCeleb using PyTorch, please visit the link here.


Now the application of VoxCeleb1 and VoxCeleb2 is widely used for the following applications.

1.Audio-Visual Speech Recognition:

2.Speech Separation

3.Cross Model transfer between face and voice recognition

4.Speaker Recognition 

5.Emotion Recognition:

If you want to participate in the VoxCeleb challenge, then visit their official website here.


We have learned about VoxCeleb dataset, how we can download from the source.VoxCeleb two versions of the dataset and their researcher. Visual and plot of the train and test data with their category on nationality and gender basis.Implementation of model in PyTorch and Matlab audio-visual of speaker Recognition. Application of VoxCeleb Datasets and how to participate in the competition challenge of Audio-Visual dataset.

More Great AIM Stories

Amit Singh
Amit Singh is Data Scientist, graduated in Computer Science and Engineering. Data Science writer at Analytics India Magazine.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>


3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM