Audio-Visual datasets is used in industry such, e.g. Alexa voice service as automatic speech recognition. In health care, the voice is routed through a speech-recognition machine for lip reading of patient, military services as High-performance fighter aircraft and digital detection system, lip-reading without any voice of dumb people, language learning as a second language.
VoxCeleb Dataset is developed by the VGG, Department of Engineering Science, University of Oxford, UK. To visit the VGG Dataset, click here. These are the primary researchers who work on creating VoxCeleb dataset from youtube.
Arsha Nagrani is a Research Scientist at Google AI Research, focused on machine learning for video understanding. Joon Son Chung and Weidi Xie, a research fellow at Visual Geometry Group, where he works on computer vision, deep learning, biomedical image analysis.
THE BELAMY
Sign up for your weekly dose of what's up in emerging technology.
DataSet:
VoxCeleb contains over 1 million utterances for over 7,000 celebrities, extracted from videos uploaded to YouTube.
Download our Mobile App
VoxCeleb Datasets is of two kinds, one is a large-scale speaker identification datasets, and the other one is Large-scale speaker verification in the wild. VoxCeleb1 dataset contains over 100,000 utterances for 1,251 celebrities and VoxCeleb2 dataset contains over a million utterances for 6,112 identities.
The ratio of Dataset on the basis of gender consists of 61 percent of male and 39 percent of female.
Image from Original dataset source.
As per shown in the pie chart, there are mainly five countries have a large number of contribution namely:
- U.S.A
- U.K
- Germany
- India
- France
Many other countries speakers are also present in the dataset but broadly plot the chart.
The DataSet is split in the train and test set, So you don’t have to worry about that.
To download the dataset visit the following page. You need to send a request form to get access to the Dataset Only for Research purposes.
Paper1: https://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf
Paper2: https://www.robots.ox.ac.uk/~vgg/publications/2018/Chung18a/chung18a.pdf
Paper3:https://www.robots.ox.ac.uk/~vgg/publications/2019/Nagrani19/nagrani19.pdf
Implementation:
The implementation of the original dataset is given in the Matlab as per the source. You can also implement it using PyTorch.
Using PyTorch:
Dataloader:
import pandas as pd import numpy as np from torch.utils.data import Dataset import utility as ut import os import torch class AudioDataset(Dataset): """Audio dataset""" def __init__(self, csv_file, base_audio_path, stft_transform=None): self._base_audio_path = base_audio_path self._table = pd.read_csv(csv_file) self._audio_data = {} self._stft_transform = stft_transform # removed samples from *.csv whose *.wav files are not available indices_to_remove = [] for idx in range(len(self._table)): wav_name = os.path.join(self._base_audio_path, self._table.wav_name[idx],) if not os.path.exists(wav_name): indices_to_remove.append(idx) self._table = self._table.drop(indices_to_remove) self._table = self._table.reset_index() def __len__(self): return len(self._table) def __getitem__(self, idx): wav_name = os.path.join(self._base_audio_path, self._table.wav_name[idx]) if wav_name in self._audio_data: wav_data = self._audio_data[wav_name] else: wav_data = ut.load_audio_sample(wav_name) self._audio_data[wav_name] = wav_data # create sample wav_data = ut.create_audio_sample(wav_data) audio_stft = ut.extract_spectrum(wav_data) audio_stft = np.vstack((audio_stft.real, audio_stft.imag)) if self._stft_transform: audio_stft = self._stft_transform(audio_stft) audio_stft = audio_stft.reshape((1, *audio_stft.shape)) audio_stft = torch.from_numpy(audio_stft.astype(dtype=np.float32)) labels =torch.from_numpy(np.array(self._table.target[idx]).astype(dtype=np.float32)) return (audio_stft, labels) The implementation of the VoxCeleb using PyTorch, please visit the link here.
https://github.com/qqueing/DeepSpeaker-pytorch
Application:
Now the application of VoxCeleb1 and VoxCeleb2 is widely used for the following applications.
1.Audio-Visual Speech Recognition:
2.Speech Separation
3.Cross Model transfer between face and voice recognition
4.Speaker Recognition
5.Emotion Recognition:
If you want to participate in the VoxCeleb challenge, then visit their official website here.
Conclusion:
We have learned about VoxCeleb dataset, how we can download from the source.VoxCeleb two versions of the dataset and their researcher. Visual and plot of the train and test data with their category on nationality and gender basis.Implementation of model in PyTorch and Matlab audio-visual of speaker Recognition. Application of VoxCeleb Datasets and how to participate in the competition challenge of Audio-Visual dataset.