OpenAI Open-Sources ‘Whisper’ — a Multilingual Speech Recognition System

The company’s open-sourced models and inference code serve as a foundation for building useful applications and boost further research on robust speech processing.
Listen to this story

Speech recognition remains a challenge in AI. However, OpenAI has just moved one step closer to solving it. In a blog post last week, OpenAI introduced Whisper—a multilingual, automatic speech recognition system that is trained and open sourced to approach human level robustness and accuracy on English speech recognition. 

Numerous organisations such as Google, Meta and Amazon have developed highly capable speech recognition systems. But OpenAI claims that Whisper stands out. The model is trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It claims to have an improved recognition of background noise, unique accents, and technical jargon owing to the use of such a large and diverse dataset. 

The company’s open-sourced models and inference code serve as a foundation for building useful applications and boost further research on robust speech processing.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Source: Introducing Whisper, OpenAI

An excerpt from the blog reads, “The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.”

The company says that other existing approaches frequently use smaller, more closely paired audio-text training datasets or broad but unsupervised audio pretraining. Since Whisper was trained on a large, diverse dataset (about a third of which is non-English audio dataset) without being fine-tuned to any specific one, it does not beat models that specialise in LibriSpeech performance. 

When measured, findings show that Whisper’s zero-shot performance across many diverse datasets is robust—making 50% fewer errors than other models. OpenAI hopes that the model’s ease of use and high accuracy will allow developers to add voice interfaces to a wider set of applications. 

To learn more about the paper, model card, and additional details on Whisper, click here

More Great AIM Stories

Bhuvana Kamath
I am fascinated by technology and AI’s implementation in today’s dynamic world. Being a technophile, I am keen on exploring the ever-evolving trends around applied science and innovation.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM