Behind Hey Siri: How Apple’s AI-Powered Personal Assistant Uses DNN



“Hey Siri” — this is the catchphrase that you use on your Apple devices when you feel helpless or bored. Siri is a built-in personal assistant chatbot, introduced in 2011 for the smartphone. Siri can help with the user with tasks such as getting information from the internet, scheduling events, setting a timer and making phone calls, among other things.

Astonishingly, Siri is powered by a speech recognition unit present in the phone which runs in the background all the time. This speech recogniser uses a Deep Neural Network (DNN) to correspond your voice patterns which are then generated as a probability distribution for those voice sounds. A process called Temporal Integration is used to compute a confidence score to check whether your voice contained the words ‘Hey Siri’. If the words are close, then Siri is activated, otherwise it isn’t. This article gives a brief look into the machine learning aspects behind Siri.

Behind the scenes

How Siri works, Image courtesy : Apple

The power to use Siri without your hands is what makes it interesting and popular as well. As shown in the figure above, the critical components are the cloud servers and the voice detection hardware present in the phone. All of these work in tandem with the cloud servers including the main automatic speech recognition, the natural language interpretation and other information services. The voice patterns are updated in the server regularly.

DNN And The Hardware

The microphone or detector in an iPhone or any Apple products such as iPad, iPod Touch and Apple Watch, turns the detected voice into a stream of instantaneous waveform samples which are created at a rate of 16,000 per second. A spectrum analysis stage converts the waveform sample stream to a sequence of frames splitting the voice into a spectrum of 0.01 second.  About 20 of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) is now set to work, which converts each of these voice pattern models  into a probability distribution over a set of speech sound classes which are used in the “Hey Siri” phrase (among other voice patterns) for a total of close to 20 sound classes categorised by Apple Inc.

The DNN consists of matrix multiplications and logistic nonlinearities. Each separate layer is an intermediate representation identified by the DNN during its training to convert the filter bank inputs to sound classes. The final nonlinearity is primarily a Softmax function (also known as general logistic or normalized exponential), the reason is to choose logarithmic probabilities over linear ones to make the computation easier.

Neural Network Structure, Image Courtesy : Apple

Networks that Apple uses typically have five hidden layers of all the same size: 32, 128, or 192 units depending on the memory,power and hardware criteria.In an iPhone, there are two networks for Siri’s functionality namely — initial detection and secondary checker. The output of the voice pattern is compared to a phonetic class ( to check whether the letter ‘S is preceded by vowel ‘i’). To ascertain whether the voice pattern model hits the “Hey Siri” phrase correctly, the pattern is computed using the function given below to accommodate the pattern values in a sequence

Fi,t = max { si + Fi,t-1, mi-1 + Fi-1,t-1} + qi,t


  • Fi,t  is the accumulated score for state i of the model
  • qi,t is the output of the acoustic model — the log score for the phonetic class associated with the ith state given the acoustic pattern around time t
  • si is a cost associated with staying in state i
  • mi is a cost for moving on from state 

The ‘s’ and ‘m’ components account for the acoustic analysis of relevant data.

Now this computation is done on the hardware at quick speeds providing a feed pattern to the hardware, which sees whether the phrase matches “Hey Siri” or not. Apple uses a threshold value to check phrases. This is how it Siri functions on the iPhone.

Siri not only has to be very responsive, but also accurate. This is possible with the iPhone’s Always On Processor (AOP) which is an auxiliary processor powering the microphone (for iPhone 6S and later). The AOP alerts the main processor when the threshold value for the phrase is received and activates the DNN of the main processor for complete processing of the query such as information from Internet, help with calling and texting and many more features.

Conclusion :

Apple has been utilising the multifold benefits of machine learning since its inception. Not just Siri, it also is exploring options with other products such as Apple Watch to make it even better and simple.

Download our Mobile App

Abhishek Sharma
I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can Apple Save Meta?

The iPhone kicked off the smartphone revolution and saved countless companies. Could the Pro Reality headset do the same for Meta?