The process of decoding human speeches by machines is called Speech Recognition. It has been gaining much traction in recent times by big tech companies. With the advancement of deep learning and natural language processing (NLP), this technique has become widespread as virtual assistants, hands-free computing, digital dictation platform, and automated subtitling for video content, among others. According to reports, the overall voice and speech recognition market is expected to grow at a CAGR of 17.2% from 2019 to 2025 to reach $26.8 billion.
However, research has revealed that Automatic Speech Recognition (ASR) technology exhibits racism for some sub groups of people. According to researchers at Stanford University, the ASR technique does not work equally well for all sub groups of the population.
The researchers examined the ability of five state-of-the-art ASR systems, which have been developed by Amazon, Apple, Google, IBM, and Microsoft, to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. The researchers found that all five Automatic Speech Recognition (ASR) systems exhibited substantial racial disparities, with an average word error rate of 0.35 for black speakers compared with 0.19 for white speakers.
The analysis performed by the researchers is based on two collected datasets of conversational speech. The first dataset is the Corpus of Regional African American Language (CORAAL), which is a collection of socio-linguistic interviews with several black individuals speaking African American Vernacular English (AAVE). The second dataset used is Voices of California (VOC), which is a compilation of interviews recorded in both rural and urban areas of the state. In total, the corpus spans five US cities and consists of 19.8 hours of audio, which is being matched on the age and gender of the speakers.
The performance of the ASR systems is evaluated in terms of the word error rate (WER). Despite variation in transcription quality across systems, the researchers found that the error rates for black speakers were approximately twice as large in each of the cases when compared to white speakers.
For the analysis of the ASR techniques, the researchers used methods such as data filtering, standardization and matching procedures. The researchers compared the ASR techniques in several ways.
Firstly, the researchers computed the average word error rates for machine transcriptions across matched audio snippets of white and black speakers. In this case, Apple ASR showed the worst overall performance.
The investigation of the racist nature of ASR techniques has been concluded by implementing two mechanisms that could account for the racial disparities. These are a performance gap in the ‘language models’ (models of lexicon and grammar) underlying modern ASR systems, and a performance gap in the acoustic models underlying these systems. The researchers found evidence of a gap in the acoustic models, but not in the language models.
The researchers found that all five ASR systems exhibited racism, with an average WER of 0.35 for black speakers, while it stood at 0.19 for white speakers. According to the researchers, the exact language models underlying commercial ASR systems are not readily available.
The findings indicate that the racial disparities arise primarily from a performance gap in the acoustic models, further suggesting that the systems may get confused by the phonological, phonetic, or prosodic characteristics of African American Vernacular English, rather than the grammatical or lexical characteristics. The cause of this inefficiency is suspected to be the insufficient amount of audio data from black speakers when training the models.
The researchers proposed strategies such as using more diverse training datasets that include African American Vernacular English. This measure can be used to mitigate the issue and reduce performance differences. It will also ensure the inclusiveness of speech recognition technology.
Read the paper here.