NVIDIA NeMo has unveiled Parakeet, a series of automatic speech recognition (ASR) models, developed in collaboration with Suno.ai.
Ranging from 0.6 to 1.1 billion parameters, these models have achieved a remarkable milestone in the field of conversational AI, offering accurate transcription of spoken English.
Click here to check out the model.
Parakeet has excelled in comparative benchmarks, outperforming OpenAI’s Whisper v3. The models are designed for seamless integration into diverse projects, featuring user-friendly pre-trained control points that contribute to their versatility in the evolving domain of speech recognition.
Under the CC BY 4.0 license, Parakeet distinguishes itself through its extensive training on a vast dataset of 64,000 hours of audio. This diverse dataset includes a wide range of accents, vocal ranges, and sound environments.
Noteworthy is its resilience against non-verbal audio elements such as music and silence, marking a significant advancement in ASR technology.
NVIDIA’s open-source speech recognition models have set a new industry standard, demonstrating human-level robustness in speech-to-text conversion. This capability extends to comprehending different accents and dialects, making them applicable in a global context.
Notably, the models exhibit robustness against background noise, addressing a common challenge in speech recognition. This enhanced feature ensures accurate audio data transcription even in less-than-ideal acoustic conditions.
Read: Why speech separation is such a difficult problem to solve
The support for multiple languages and accents significantly broadens their utility. NVIDIA’s decision to release these models under the MIT license fosters innovation and accessibility in the field.
Benchmark tests, including the widely recognized LibriSpeech dataset, underscore the superior performance of NVIDIA’s models compared to Whisper v3. This represents a substantial stride in ASR technology, offering promising indications of their real-world applicability.

