Listen to this story
Just yesterday, OpenAI released ChatGPT and Whisper API for the world. Now all the developers and companies, including clones, can officially integrate the model within their apps and products. But, one question remains—Why OpenAI, the leading AI research organisation, is not building a text-to-speech model?
Of late, there have been several models that were able to simulate text in different voices of notable personalities such as Joe Biden, Donald Trump, Barack Obama, or George Bush, with eerie distinguishability. Obviously, the internet used the technology for making these people spew out nonsense, if they were not doing that anyway. This led to a lot of controversy and a lot of uproar about the ethical implications of such models.
Imagine—what if ChatGPT suddenly had a voice?
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Sounds optimistic and scary at the same time. Let’s assume the real reason why OpenAI doesn’t have a text-to-speech model is that they’re afraid of the consequences. Yes, you heard it here first. OpenAI might just be scared of what their AI will say if they give it a voice.
Now that the ChatGPT API is available, users can easily integrate it with any text-to-speech application to build something that narrates the AI-generated text. This task was earlier being done by extensions on browsers, such as ‘chatGPT auto speech’ or ‘Talk-to-ChatGPT’. But the problem with these are that they are not “automatic” or “handy”, as they are extensions. The audio generated from these models sounds similar to Cortana, Siri, or Google Assistant—dull and monotonous.
With the new APIs, developers are already pulling their socks up to build further. HackerFM is an AI generated HackerNews podcast generator built on ChatGPT API. The model reads out all the latest developments in AI that are published on the blog. The voices sound like two people in an actual conversation.
What Should OpenAI Do?
The Sam Altman led organisation can maybe take notes from other research organisations for this. ElevenLabs recently released ‘Voice Design’, a generative AI for audio. The company researched speech synthesis and voice cloning. Last month, they also released ‘This Voice Doesn’t Exist’, which—similar to ‘This Face Does Not Exist’—allows users to design entirely new synthetic voices.
While these two models are interesting, another one from ElevenLabs, ‘The First AI that can laugh’, is what makes the most sense. Released late last year, this voice generation model is trained on 500K hours of data and can understand emotions from the text depending on the punctuation, syntax, and, most importantly, context. Nothing else influences the output.
If this technology can be integrated into ChatGPT, it would be able to generate almost life-like voices, with emotions that resemble those of a real person. This is more than just a deep fake. Obviously, there are ethical implications, which the company is increasingly getting concerned about, but this could be the only piece of the puzzle that OpenAI is currently (hopefully) trying to solve with their AGI Roadmap.
Or Maybe, OpenAI Doesn’t Care
Text-to-speech models are everywhere these days. You can find them in your phones, computers, and even your cars. But they sound “robotic”. The new technology can definitely improve upon this, and OpenAI is a worthy consideration. So, apart from the ethical reasons, why hasn’t OpenAI jumped on the bandwagon? Well, they have another good reason: They might find it stupid.
Yes, you read that right. Stupid. OpenAI is focused on developing something bigger than a simple text-to-speech model. They’re working on creating AI systems that can understand and interpret language like humans do. That’s no small feat and it requires a lot more resources and brainpower than creating a text-to-speech model, or is it?
A very important thing to note—OpenAI is funded by Microsoft. This big-tech giant already has a text-to-speech model, ‘VALL-E’. This language model is used for text-to-speech synthesis (TTS). With only a three-second recording of a person’s voice, the model can create high-quality speech of the inserted text.
Of course, OpenAI’s decision to not develop a text-to-speech model does not mean that they are not aware of its potential benefits. Moreover, they have Microsoft’s VALL-E, whose development is likely to have had a lot of support from OpenAI. It is also likely that OpenAI may revisit this technology in the future and develop their own version that surpasses existing models in terms of quality and versatility, including Microsoft’s.
Wait, There is More
In retrospect, assistants like Alexa or Cortana (Microsoft’s failed voice assistant), would receive our commands and browse the web to give the desired output by talking back. Now, the technology has improved significantly and can definitely be implemented to better this field.
For example, ‘Whisper’ is a state-of-the-art model that converts human speech into written text. Conversely, VALL-E does the inverse. Then, if the capabilities of speech synthesis like ElevenLabs is combined, we might be able to converse with a machine through our voices alone and it would respond the same way in return. There would be no exchange of text at all, as it could be processed by the model in the background. Sounds like the ‘Terminator’ or Joaquin Phoenix’s ‘Her’—dystopian or utopian.