Speech recognition remains a challenge in AI. However, OpenAI has just moved one step closer to solving it. In a blog post last week, OpenAI introduced Whisper—a multilingual, automatic speech recognition system that is trained and open sourced to approach human level robustness and accuracy on English speech recognition.
Numerous organisations such as Google, Meta and Amazon have developed highly capable speech recognition systems. But OpenAI claims that Whisper stands out. The model is trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It claims to have an improved recognition of background noise, unique accents, and technical jargon owing to the use of such a large and diverse dataset.
The company’s open-sourced models and inference code serve as a foundation for building useful applications and boost further research on robust speech processing.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Source: Introducing Whisper, OpenAI
An excerpt from the blog reads, “The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.”
Download our Mobile App
The company says that other existing approaches frequently use smaller, more closely paired audio-text training datasets or broad but unsupervised audio pretraining. Since Whisper was trained on a large, diverse dataset (about a third of which is non-English audio dataset) without being fine-tuned to any specific one, it does not beat models that specialise in LibriSpeech performance.
When measured, findings show that Whisper’s zero-shot performance across many diverse datasets is robust—making 50% fewer errors than other models. OpenAI hopes that the model’s ease of use and high accuracy will allow developers to add voice interfaces to a wider set of applications.
To learn more about the paper, model card, and additional details on Whisper, click here.