“Amazon’s cloud-based voice service Alexa powers Amazon Echo devices and more than 140,000 models of smart speakers, lights, plugs, smart TVs, and cameras.”
At last year’s re:Invent conference, AWS announced the launch of its Inferentia chips designed to process machine learning workloads. This week, AWS has announced that the Alexa services will now be powered by AWS Inferentia, their own chip. As a result, they have migrated the majority of their GPU-based ML inference workloads to Amazon Elastic Compute Cloud (EC2) Inf1 instances.
According to Amazon, every month, tens of millions of customers interact with Alexa to control their home devices. They claim that there are more than 100 million devices connected to Alexa and migrating to Inferentia chips have made Alexa services even better. Compared to GPU-based instances, Inferentia has led to a 25% lower end-to-end latency, and 30% lower cost for Alexa’s text-to-speech(TTS) workloads. The lower latency, says Amazon, has allowed Alexa engineers to try out more complex algorithms and to enhance the overall Alexa experience for their customers.
How Is Inferentia Helping Alexa?
“Migrating to AWS Inferentia resulted in 25% lower end-to-end latency, and 30% lower cost compared to GPU-based instances for Alexa’s text-to-speech workloads.”
Deploying machine learning models can be very resource-intensive, and the inference is where most of the actual work gets done if some applications have to perform better. AWS Inferentia is designed to handle these specific ML-based inference workloads.
Each AWS Inferentia chip contains four NeuronCores that are equipped with a large on-chip cache. This helps cut down on external memory accesses, dramatically reducing latency and speeds up typical deep learning operations such as convolution and transformers. Speeding up of deep learning operations is critical to Alexa.
- Automatic Speech Recognition (ASR): First, Alexa converts the sound to text.
- Natural Language Understanding (NLU): Alexa then tries to understand what he heard.
- Text-To-Speech (TTS): Generate voice from text
Of Alexa’s three main inference workloads (ASR, NLU, and TTS), Text-to-Speech(TTS) workloads initially ran on GPU-based instances. This Text-To-Speech process also heavily involves machine learning models to build a phrase that sounds natural in terms of pronunciations, rhythm, connection between words, intonation etc.
Alexa encounters billions of inference requests every week. This whole process uses artificial intelligence heavily to transform the sound to phonemes, phonemes to words, words to phrases, and phrases to intents. Added to this are the multilingual translations. Some latency is expected, but Amazon does not want to leave any room for complacency or latency, and AWS Inferentia is making sure the services are top-notch.
Amazon’s Silicon Ambitions & Future Direction
Amazon has made its hardware ambitions obvious as early as 2015. Predicting that hardware specialization is going to be a big deal, Amazon has had a custom ASIC team focused on AWS ever since. In 2016, James Hamilton, VP at AWS, demoed the custom ASIC that powered AWS servers for many years.
Today, AWS has its own custom-built AI chip, Inferentia and even a custom-built processor Graviton2. So far, the majority of the data centres are powered by the integrated solutions provided by the likes of Intel, NVIDIA and AMD. With its home-grown silicon, Amazon is gradually moving towards self reliability similar to what Apple has been doing with its own silicon efforts. In the last couple of years, Amazon has increased the involvement of its own hardware solutions with its services. The latest being Alexa’s workload migration to Inferentia. The data centre is a huge market for Intel and other chipmakers. And, AWS is a giant when it comes to data centres. It leads the cloud segment and flaunts a diverse portfolio of customers like Netflix.
If Amazon decides to incorporate its integrated homemade solutions for its data centres, then it will be a big blow to the chip makers who rely heavily on offering silicon services. Google has TPUs, and now AWS has Inferentia. If cloud service providers can match the performance benchmarks of top chipmakers, then it will be the beginning of a new wave of infrastructure-as-a-service industry. For companies like Amazon who have made inroads to the consumer base, B2B services, AI research and now silicon, there cannot be a better time.