MITB Banner

OpenAI Open-Sources ‘Whisper’ — a Multilingual Speech Recognition System

The company’s open-sourced models and inference code serve as a foundation for building useful applications and boost further research on robust speech processing.

Share

Listen to this story

Speech recognition remains a challenge in AI. However, OpenAI has just moved one step closer to solving it. In a blog post last week, OpenAI introduced Whisper—a multilingual, automatic speech recognition system that is trained and open sourced to approach human level robustness and accuracy on English speech recognition. 

Numerous organisations such as Google, Meta and Amazon have developed highly capable speech recognition systems. But OpenAI claims that Whisper stands out. The model is trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It claims to have an improved recognition of background noise, unique accents, and technical jargon owing to the use of such a large and diverse dataset. 

The company’s open-sourced models and inference code serve as a foundation for building useful applications and boost further research on robust speech processing.

Source: Introducing Whisper, OpenAI

An excerpt from the blog reads, “The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.”

The company says that other existing approaches frequently use smaller, more closely paired audio-text training datasets or broad but unsupervised audio pretraining. Since Whisper was trained on a large, diverse dataset (about a third of which is non-English audio dataset) without being fine-tuned to any specific one, it does not beat models that specialise in LibriSpeech performance. 

When measured, findings show that Whisper’s zero-shot performance across many diverse datasets is robust—making 50% fewer errors than other models. OpenAI hopes that the model’s ease of use and high accuracy will allow developers to add voice interfaces to a wider set of applications. 

To learn more about the paper, model card, and additional details on Whisper, click here

Share
Picture of Bhuvana Kamath

Bhuvana Kamath

I am fascinated by technology and AI’s implementation in today’s dynamic world. Being a technophile, I am keen on exploring the ever-evolving trends around applied science and innovation.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.