OpenAI Open-Sources ‘Whisper’ — a Multilingual Speech Recognition System

The company’s open-sourced models and inference code serve as a foundation for building useful applications and boost further research on robust speech processing.

Speech recognition remains a challenge in AI. However, OpenAI has just moved one step closer to solving it. In a blog post last week, OpenAI introduced Whisper—a multilingual, automatic speech recognition system that is trained and open sourced to approach human level robustness and accuracy on English speech recognition. 

Numerous organisations such as Google, Meta and Amazon have developed highly capable speech recognition systems. But OpenAI claims that Whisper stands out. The model is trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It claims to have an improved recognition of background noise, unique accents, and technical jargon owing to the use of such a large and diverse dataset. 

The company’s open-sourced models and inference code serve as a foundation for building useful applications and boost further research on robust speech processing.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Source: Introducing Whisper, OpenAI

An excerpt from the blog reads, “The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.”


Download our Mobile App



The company says that other existing approaches frequently use smaller, more closely paired audio-text training datasets or broad but unsupervised audio pretraining. Since Whisper was trained on a large, diverse dataset (about a third of which is non-English audio dataset) without being fine-tuned to any specific one, it does not beat models that specialise in LibriSpeech performance. 

When measured, findings show that Whisper’s zero-shot performance across many diverse datasets is robust—making 50% fewer errors than other models. OpenAI hopes that the model’s ease of use and high accuracy will allow developers to add voice interfaces to a wider set of applications. 

To learn more about the paper, model card, and additional details on Whisper, click here

Sign up for The AI Forum for India

Analytics India Magazine is excited to announce the launch of AI Forum for India – a community, created in association with NVIDIA, aimed at fostering collaboration and growth within the artificial intelligence (AI) industry in India.

Bhuvana Kamath
I am fascinated by technology and AI’s implementation in today’s dynamic world. Being a technophile, I am keen on exploring the ever-evolving trends around applied science and innovation.

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

GPT-4: Beyond Magical Mystery

The OpenAI CEO believes that by ingesting human knowledge, the model is acquiring a form of reasoning capability that could be additive to human wisdom in some senses.