Now, Transformers are being applied for keyword spotting

Keyword spotting is an important aspect of speech-based user interaction on smart devices, which requires a real-time response and high accuracy to offer a superior user experience.
Keyword spotting

The Transformer architecture has been successfully applied across various domains – language processing, computer vision, time series analysis, among others. Researchers have found yet another domain that could do well with Transformer – keyword spotting.

Scientists from ARM ML Labs and the Lund University recently presented a paper titled ‘Keyword Transformer: A Self-Attention Model for Keyword Spotting’ at the recent InterSpeech Conference. Unlike the conventional methods, the researchers presented a range of ways to use the Transformer architecture for keyword spotting, thereby presenting a Keyword Transformer. It is a fully functional self-attention architecture that offers state-of-the-art performance without pre-training or additional data.

Keyword spotting

A voice assistant (like Google Assistant and Siri) pipeline consists of different stages, the first of which is the trigger phase. In this phase, the assistant tries to capture the ‘trigger phrase’ like play or pause. The functions initiated upon listening to these trigger phrases are less compute-intensive than automatic speech recognition (ASR). So these functions can be performed on-device with low latency. This on-device keyword spotting is also useful when no internet connection is available or in case of data privacy issues.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Credit: ARM

Keyword spotting is an important aspect of speech-based user interaction on smart devices, which requires a real-time response and high accuracy to offer a superior user experience. Conventionally, machine learning techniques like deep neural networks, convolutional neural networks, recurrent neural networks, among others, have been used for keyword spotting over other speech recognition algorithms for their better performance. 

Download our Mobile App

Attention mechanism for keyword spotting

Attention mechanisms have been used in keyword spotting but only as an extension to the above mentioned neural network. With Keyword Transformer, the researchers have explored the self-attention mechanism independently for keyword spotting. This system proved to outperform the existing mechanisms on a smaller Google Speech Commands dataset without an additional dataset. They also found that applying self-attention is more effective in the time domain rather than the frequency domain.

The researchers behind the Keyword Transformer say that they were heavily inspired by Vision Transformer, which computes self-attention between different image patches. This approach has been applied to keyword spotting, too, in a way that the audio spectrogram patches are taken as input to understand how this technique applies to new domains. 

Credit: ARM

In Keyword transformers, the raw audio waveform is preprocessed by dividing the signal into a set of time slots and then extracting Mel-frequency cepstrum coefficient (MFCCs) for each slot. Each set of MFCC is accepted as an input token to the Transformer, and audio features are extracted based on how different time slots interact with each other. This mechanism also makes the features more descriptive than traditional neural networks. The Keyword Transformer outputs a global feature vector that is fed into a multi-layer perceptron (MLP) that classifies audio into keywords or non-keywords. 

The researchers have observed that the model benefits from large scale pre-training, rendering 5.5 times latency reduction through model compression and over 4000 times energy reduction through sparsity and hardware co-design.

Transformers’ growing popularity

The machine learning world is moving on from traditional neural networks to Transformers, facilitating the latter’s rise as the next big thing. Transformers are already being chosen for advanced natural language processing and computer vision tasks. With innovation and time, new areas of applications are being discovered. What works in favour of Transformer is the self-attention mechanism where features are dynamically calculated by attending different parts of the input to each other.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox