MITB Banner

New Research Suggests Speech Recognition Technology May Be Racist

Share

The process of decoding human speeches by machines is called Speech Recognition. It has been gaining much traction in recent times by big tech companies. With the advancement of deep learning and natural language processing (NLP), this technique has become widespread as virtual assistants, hands-free computing, digital dictation platform, and automated subtitling for video content, among others. According to reports, the overall voice and speech recognition market is expected to grow at a CAGR of 17.2% from 2019 to 2025 to reach $26.8 billion.

However, research has revealed that Automatic Speech Recognition (ASR) technology exhibits racism for some sub groups of people. According to researchers at Stanford University, the ASR technique does not work equally well for all sub groups of the population.

The researchers examined the ability of five state-of-the-art ASR systems, which have been developed by Amazon, Apple, Google, IBM, and Microsoft, to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. The researchers found that all five Automatic Speech Recognition (ASR) systems exhibited substantial racial disparities, with an average word error rate of 0.35 for black speakers compared with 0.19 for white speakers. 

Dataset Used

The analysis performed by the researchers is based on two collected datasets of conversational speech. The first dataset is the Corpus of Regional African American Language (CORAAL), which is a collection of socio-linguistic interviews with several black individuals speaking African American Vernacular English (AAVE). The second dataset used is Voices of California (VOC), which is a compilation of interviews recorded in both rural and urban areas of the state. In total, the corpus spans five US cities and consists of 19.8 hours of audio, which is being matched on the age and gender of the speakers.

The performance of the ASR systems is evaluated in terms of the word error rate (WER). Despite variation in transcription quality across systems, the researchers found that the error rates for black speakers were approximately twice as large in each of the cases when compared to white speakers.

The Analysis

For the analysis of the ASR techniques, the researchers used methods such as data filtering, standardization and matching procedures. The researchers compared the ASR techniques in several ways. 

Firstly, the researchers computed the average word error rates for machine transcriptions across matched audio snippets of white and black speakers. In this case, Apple ASR showed the worst overall performance. 

The investigation of the racist nature of ASR techniques has been concluded by implementing two mechanisms that could account for the racial disparities. These are a performance gap in the ‘language models’ (models of lexicon and grammar) underlying modern ASR systems, and a performance gap in the acoustic models underlying these systems. The researchers found evidence of a gap in the acoustic models, but not in the language models.

The Outcome

The researchers found that all five ASR systems exhibited racism, with an average WER of 0.35 for black speakers, while it stood at 0.19 for white speakers. According to the researchers, the exact language models underlying commercial ASR systems are not readily available. 

The findings indicate that the racial disparities arise primarily from a performance gap in the acoustic models, further suggesting that the systems may get confused by the phonological, phonetic, or prosodic characteristics of African American Vernacular English, rather than the grammatical or lexical characteristics. The cause of this inefficiency is suspected to be the insufficient amount of audio data from black speakers when training the models.

Wrapping Up

The researchers proposed strategies such as using more diverse training datasets that include African American Vernacular English. This measure can be used to mitigate the issue and reduce performance differences. It will also ensure the inclusiveness of speech recognition technology.

Read the paper here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.