Listen to this story
As ChatGPT (built on GPT-3.5 architecture) continues to make waves across the globe, OpenAI has subtly launched the second version of Whisper, an open-sourced multilingual speech recognition model.
This new model is trained for more EPOCHs with regularisation and shows improved performance compared to the previous version. However, it has the same architecture as the original large model. The team said that it would be updating its research paper soon.
Click here to view the source code of OpenAI Whisper V2.
In October, AI research and development company, OpenAI released Whisper, which could translate and transcribe speech from 97 diverse languages. Whisper is trained on over 680,000 hours of multilingual data collected from the web. However, the training dataset for Whisper had been kept private.
Since Whisper‘s first version was trained using a comparatively larger and more diverse dataset. It wasn’t fine-tuned to a specific dataset, due to which it didn’t surpass other models that were specialised around the LibriSpeech performance benchmark, one of the most noted parameters to judge speech recognition.
OpenAI in its blog stated that it hoped that Whisper would serve as a foundation for building useful applications and for further research on robust speech processing.
Currently, the company is experimenting across various offerings. This includes DALL.E 2 which can produce art from text, the latest ChatGPT, or even the much-awaited GPT 4. However, using Whisper only to translate and transcribe audio is under-utilising the scope to do much more.
Among the major challenges are the user’s laptop being not powerful enough compared to those used for professional transcription services. Secondly, installing the model is not very user-friendly. Another disadvantage is that the prediction is often biased to integer timestamps.
Users observed that those tend to be less accurate; blurring the predicted distribution may help, but no conclusive study has been done yet.
While there are a host of advantages to using the model, there are also potential risks and disadvantages.
On GitHub, under the ‘Broader Implications’ section of the model card, OpenAI warns that it could be used to automate surveillance or identify individual speakers in a conversation, but the company hopes it will be used “primarily for beneficial purposes”.