Listen to this story
|
Speech recognition in machine learning has always been one of the most difficult tasks to perfect. The first speech-recognition software was developed in the 1950s, and we’ve come a long way since.
Recently, OpenAI took a leap in the domain by introducing Whisper. The company says it “approaches human level robustness and accuracy on English speech recognition” and can automatically recognise, transcribe, and translate other languages like Spanish, Italian, and Japanese.
There’s no doubt that Whisper works better than any other commercial ASR (automatic speech recognition) system like Alexa, Siri, and Google Assistant. OpenAI, the company that usually does not do justice to its name, decided to open source this model. The digital experience will change radically for many people, but is the model revolutionary?
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Here’s what you need to know
As with almost every major new AI model these days, Whisper brings along advantages and potential risks associated with it. On Whisper’s under the ‘Broader Implications’ section of the model card, OpenAI warns that it could be used to automate surveillance or identify individual speakers in a conversation, but the company hopes it will be used “primarily for beneficial purposes”.
Conversations have also surfaced on the internet about the challenges faced by the early users of this revolutionary transformer model. As a side note, OpenAI researchers chose the original transformer architecture because they wanted to prove that high-quality supervised ASR is possible if enough data is available.
The main challenge is that your laptop may not be as powerful as the computers of professional transcription services. For example, Mitchell Clarke fed the audio from a 24-minute-long interview into Whisper, running on M1 MacBook Pro. It took almost an hour to transcribe the file. On the contrary, Otter completed the transcription within eight minutes.
Secondly, installing Whisper is not really a user-friendly process for everyone. Journalist Peter Sterne teamed up with GitHub developer advocate Christina Warren to try and fix the issue by creating a “free, secure, and easy-to-use transcription app for journalists” based on Whisper’s ML model. Sterne said that he decided the program, dubbed Stage Whisper, should exist after he ran some interviews through it and determined that it was “the best transcription he’d ever used, with the exception of human transcribers”.
I'm working on a new project with @film_girl to create a free, secure, and easy-to-use transcription app for journalists, powered by @openai's whisper ML model.
— Peter Sterne (@petersterne) September 22, 2022
If you're interested in contributing to the project, please let us know and we'll add you to the github repo.
Another red flag is that the prediction is often biased to integer timestamps. Users observed that those tend to be less accurate; blurring the predicted distribution may help, but no conclusive study has been done yet. The timestamp decoding heuristics is a bit naïve and could be improved along with word-level timestamping.
Peculiar failure case?
Whisper has also been described as a ‘peculiar failure case‘. The reason being, the model sometimes exhibits failures in recognition quality.
(Credits: https://docs.google.com/spreadsheets/d/1xdaK-RJZ2ftMKBME45aAeEmMHSJSxb3wW8-GzT1whgg/edit?usp=sharing)
While testing Whisper, Talon, and Nemo against the exact same test sets with the same text normalization, all of the large models performed well at general dictation. However, Whisper was painfully slow compared to the other models tested. Much higher output can be achieved when running GPU tests on the largest Talon 1B model and Nemo xlarge (600M) model than any Whisper model, including Whisper Tiny (39M).
Whisper output is very good at producing coherent speech, even when it is completely incorrect about what was said. While analysing some ‘worst case’ outputs, neither Talon nor Nemo models showed worst-case results, anything like this. Most of Talon’s errors in this test set were compound word splits.
An analysis of the paper found that, in general, at least for Indian languages, translations are better. However, transcriptions are suffering from catastrophic failures.
In conclusion, Whisper is a very neat set of models and capabilities, especially the multilingual and translation use cases. It will also probably be a great tool to supervise the training of other models. However, given the observed failure cases, users may not use Whisper in production without a second model to double-check the output.