OpenAI’s Whisper is Revolutionary but (Little) Flawed

Trained on 680k hours of audio data, Whisper offers everything from real-time speech recognition to multilingual translation
Listen to this story

Speech recognition in machine learning has always been one of the most difficult tasks to perfect. The first speech-recognition software was developed in the 1950s, and we’ve come a long way since. 

Recently, OpenAI took a leap in the domain by introducing Whisper. The company says it “approaches human level robustness and accuracy on English speech recognition” and can automatically recognise, transcribe, and translate other languages like Spanish, Italian, and Japanese.

There’s no doubt that Whisper works better than any other commercial ASR (automatic speech recognition) system like Alexa, Siri, and Google Assistant. OpenAI, the company that usually does not do justice to its name, decided to open source this model. The digital experience will change radically for many people, but is the model revolutionary? 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Here’s what you need to know

As with almost every major new AI model these days, Whisper brings along advantages and potential risks associated with it. On Whisper’s under the ‘Broader Implications’ section of the model card, OpenAI warns that it could be used to automate surveillance or identify individual speakers in a conversation, but the company hopes it will be used “primarily for beneficial purposes”.

Conversations have also surfaced on the internet about the challenges faced by the early users of this revolutionary transformer model. As a side note, OpenAI researchers chose the original transformer architecture because they wanted to prove that high-quality supervised ASR is possible if enough data is available.

The main challenge is that your laptop may not be as powerful as the computers of professional transcription services. For example, Mitchell Clarke fed the audio from a 24-minute-long interview into Whisper, running on M1 MacBook Pro. It took almost an hour to transcribe the file. On the contrary, Otter completed the transcription within eight minutes.

Secondly, installing Whisper is not really a user-friendly process for everyone. Journalist Peter Sterne teamed up with GitHub developer advocate Christina Warren to try and fix the issue by creating a “free, secure, and easy-to-use transcription app for journalists” based on Whisper’s ML model. Sterne said that he decided the program, dubbed Stage Whisper, should exist after he ran some interviews through it and determined that it was “the best transcription he’d ever used, with the exception of human transcribers”.

Another red flag is that the prediction is often biased to integer timestamps. Users observed that those tend to be less accurate; blurring the predicted distribution may help, but no conclusive study has been done yet. The timestamp decoding heuristics is a bit naïve and could be improved along with word-level timestamping.

Peculiar failure case?

Whisper has also been described as a ‘peculiar failure case‘. The reason being, the model sometimes exhibits failures in recognition quality.


While testing Whisper, Talon, and Nemo against the exact same test sets with the same text normalization, all of the large models performed well at general dictation. However, Whisper was painfully slow compared to the other models tested. Much higher output can be achieved when running GPU tests on the largest Talon 1B model and Nemo xlarge (600M) model than any Whisper model, including Whisper Tiny (39M).

Whisper output is very good at producing coherent speech, even when it is completely incorrect about what was said. While analysing some ‘worst case’ outputs, neither Talon nor Nemo models showed worst-case results, anything like this. Most of Talon’s errors in this test set were compound word splits.

An analysis of the paper found that, in general, at least for Indian languages, translations are better. However, transcriptions are suffering from catastrophic failures. 

In conclusion, Whisper is a very neat set of models and capabilities, especially the multilingual and translation use cases. It will also probably be a great tool to supervise the training of other models. However, given the observed failure cases, users may not use Whisper in production without a second model to double-check the output.

Tasmia Ansari
Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox