“We will have a digital Babel fish that will hide in your ear and translate all the world’s languages—in ten years.”
Raj Reddy
If you have used Siri or Alexa, then you owe a lot to Dr. Raj Reddy, an Indian-American AI pioneer. Reddy and his colleagues have been pushing the boundaries of AI for many decades now. He also has this habit of engaging in friendly wagers with his peers about futuristic technologies. Usually, Reddy would posit an idea or an innovation that borders on optimism. His colleagues would then take the opposite stance of his almost impossible sounding techno-optimism. Last week, Reddy and fellow computer scientists, Gordon Bell, John Hennessy, Ed Lazowska and Andy Van Dam came together for a virtual event organised by CHM to fete Reddy. And Reddy came up with a new wager. He posited that in ten years, we will have a 21st-century version of babel fish, in the form of an earpiece that can translate hundreds of languages in real time. Babel fish was a prop in the sci-fi blockbuster Hitchhiker’s Guide to Galaxy.
According to a record kept by Bell, who always bet against him, here are a few of Reddy’s losing predictions:
- Reddy predicted that by the year 1996, video-on-demand would be available in 5 cities with more than 250,000 people having access to the service. He missed the mark by at least a decade.
- By 2002, 10,000 workstations would communicate at a speed of GBs per second.
- Reddy thought people would embrace AI’s significance by 2003. But, not until the publishing of the seminal AlexNet paper in 2012, people woke up to the power and impact of AI. A decade into the ImageNet competition, AI has added many feathers to its cap; it went from recognising faces to creating faces (deep fakes). Today, we have language models like GPT-3 producing human-like text.
Reddy’s predictions are at least a decade ahead of time. There are a couple of things to consider to understand how realistic the idea of Babel fish is:
1| State of multiple language translation
Language models like GPT-3 and BERT are still struggling to address issues like bias even on highly available datasets like English. However, language models have already started exploring even regional languages. These innovations have even been co opted into products such as Google Lens and Google Translate. Image-to-text and text-to-image have made great progress in the last couple of years. However, when it comes to speech recognition, the ML models still fall short. For example, an Alexa device trained on American English can struggle in an Indian setting. However, the direction in which the field of speech recognition is moving makes Raj’s wager all the more interesting. For instance, researchers at Amazon’s Alexa have moved to end-to-end deep learning models instead of the specialised acoustic and language models. The neural networks in these end-to-end models take in the acoustic speech signals as input and directly output transcribed speech. This eliminates the overhead (think: latency), which results from having specialised models.
According to Shehzad Mevawalla, the director of automatic speech recognition for Alexa, full neural representation has reduced the size of the model to 1/100th the size. “These models can then be deployed on our devices and executed using our own Amazon neural processor, AZ1 — a neural accelerator that is optimized to run deep neural networks,” said Mevawalla. The researchers are betting on semi-supervised learning to tackle huge swathes of unannotated speech data generated by over millions of Alexa devices around the world. Meanwhile, the researchers at Facebook AI have claimed to come up with a better model that trumps semi supervised models. Their wav2vec model could learn representations from audio speech even with less labeled data.
The mathematical aspect of the bargain looks optimistic but what about the physical aspect?
Also Read: 6 Ways Speech Synthesis Is Being Powered By Deep Learning
2| Compactness
Another biggest challenge to Reddy’s prognosis is AI’s insane thirst for computing power. Anyone who used a CPU or GPU for training ML models would have come across heating issues. Though the team at Alexa claims to have minimised the memory and computing footprint through quantisation techniques, the compactness of the earpiece will be an issue. Though on-device ML has already been realised through edge devices and frameworks like TensorFlow Lite, a highly efficient sleek earpiece equipped with high speed connectivity is still a distant dream. While limitations of Moore’s law barricades the physical aspects of Reddy’s forecast, the lack of enough curated language datasets could pose another challenge. “If we don’t figure out how to extend Moore’s Law, you’ll put that thing in your ear, and it will burn your head up,” said Ed Lazowska, an ACM fellow and a computer scientist.
“There is a spectrum of cockeyed techno-optimism which Raj occupies.”
Andy Van Dam
Andy Van Dam, another ACM fellow and a professor of computer science, said Reddy’s prediction falls in “the spectrum of cockeyed techno-optimism”. Van Dam also spoke about the financial incentives, and whether the markets are ripe for such products. Then, there is the challenge of user interface design. “I try to be a pragmatist. I think we will get close but not quite there,” concluded Van Dam.
That said, Reddy himself is a bit sceptical of his own prediction.Accessibility, according to Reddy, will be the reason why he might lose this bet. “Technology isn’t enough, you need accessibility and ease of use. It has to be completely unintrusive, like a Babel fish, to fit in our ear, recognize the language and translate it,” said Reddy.