Meta gives away a free video dataset of 846 hours

The Casual Conversations dataset comprises 846 hours of 45,000 videos, each up to a minute long on average.

On February 1, Meta announced a new resource to advance fairness in speech recognition: The company’s AI team released a research paper on a new project called ‘Casual Conversations’, an exhaustive data set with manual transcriptions to help researchers evaluate the accuracy of audio models.

Machine Learning models are as good as their data. When the models tend to only recognise the voice patterns of a white person, and appear to neglect a certain community, race or gender, it indicates a knowledge gap that seems unfair. In the context of ML, fairness refers to the attempts made to correct these biases in the statistical data. Ideally, data should be sensitised and equally representative of communities regardless of disability, ethnicity and gender. 

The research on Automatic Speech Recognition (ASR) systems pale in comparison to the studies in the area of facial recognition.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Past studies

According to a Stanford study in 2020, the speech recognition systems of the biggest tech companies like Amazon, Apple, Google, Microsoft and IBM failed to identify 19% of the words when the user was white and 35% when the user was Black. However, only two companies responded to the study. While Amazon said it was constantly improving its speech recognition service, Google acknowledged the inefficiencies and claimed it has been taking a long, hard look at the model flaws.

In 2014, Google researchers wrote a paper detailing the reason behind the biases in language. Titled, ‘Discriminative Pronunciation Modelling for Dialectical Speech Recognition,’ the paper spoke about how African American Vernacular English (AAVE), a dialect mostly used by African Americans in casual speech is different from Standard American English (SAE) in terms of pronunciation and vocabulary. The accuracy of an ASR system dropped around a specific dialect due to the lack of representation in training data.

Download our Mobile App

Diverse dataset 

The Casual Conversations dataset comprises 846 hours of 45,000 videos, each up to a minute long on average. The conversations of more than 3,000 participants from different ages, ethnicities and gender on random subjects went into the datasets. In addition, researchers did a taxonomy of the collected speech based on the skin tones. While skin colour is a more important variable in computer vision, the skin tone of the participant could be interrelated to variables in speech.

The researchers made speech recognition models including a LibriSpeech model, a supervised Video model, a semi-supervised Video model and a semi-supervised teacher Video model. The results showed big accuracy gaps in terms of gender but not across age groups. As it turned out, skin colour was an important factor in driving different performances among subgroups. The more varied and larger the dataset, the lesser the comparative error rates of the ASR model, the study concluded. The dataset must represent a diverse range of attributes from subgroups to achieve more evenly distributed accuracies.


Last October, Speechmatics, a UK-based speech recognition company, said its speech recognition system had an accuracy of 83% for African American users. Speechmatics beat Microsoft (73%) Amazon and Google (69% each), IBM (62%) and Apple (55%) hands down in accuracy levels. The company’s model failed to recognise 17% of the words spoken by Black voices compared to Amazon and Google’s 31%. 

Speechmatics said it had trained its ML models on reams of unlabelled data from podcasts and social media to expose the software to different accents, styles and grammar. “It would be good if people were open-sourcing test sets that let you evaluate how well you’re doing on this front,” Will Williams, the company’s vice-president of ML, said. 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox