The Transformer architecture has been successfully applied across various domains – language processing, computer vision, time series analysis, among others. Researchers have found yet another domain that could do well with Transformer – keyword spotting.
Scientists from ARM ML Labs and the Lund University recently presented a paper titled ‘Keyword Transformer: A Self-Attention Model for Keyword Spotting’ at the recent InterSpeech Conference. Unlike the conventional methods, the researchers presented a range of ways to use the Transformer architecture for keyword spotting, thereby presenting a Keyword Transformer. It is a fully functional self-attention architecture that offers state-of-the-art performance without pre-training or additional data.
A voice assistant (like Google Assistant and Siri) pipeline consists of different stages, the first of which is the trigger phase. In this phase, the assistant tries to capture the ‘trigger phrase’ like play or pause. The functions initiated upon listening to these trigger phrases are less compute-intensive than automatic speech recognition (ASR). So these functions can be performed on-device with low latency. This on-device keyword spotting is also useful when no internet connection is available or in case of data privacy issues.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Keyword spotting is an important aspect of speech-based user interaction on smart devices, which requires a real-time response and high accuracy to offer a superior user experience. Conventionally, machine learning techniques like deep neural networks, convolutional neural networks, recurrent neural networks, among others, have been used for keyword spotting over other speech recognition algorithms for their better performance.
Download our Mobile App
Attention mechanism for keyword spotting
Attention mechanisms have been used in keyword spotting but only as an extension to the above mentioned neural network. With Keyword Transformer, the researchers have explored the self-attention mechanism independently for keyword spotting. This system proved to outperform the existing mechanisms on a smaller Google Speech Commands dataset without an additional dataset. They also found that applying self-attention is more effective in the time domain rather than the frequency domain.
The researchers behind the Keyword Transformer say that they were heavily inspired by Vision Transformer, which computes self-attention between different image patches. This approach has been applied to keyword spotting, too, in a way that the audio spectrogram patches are taken as input to understand how this technique applies to new domains.
In Keyword transformers, the raw audio waveform is preprocessed by dividing the signal into a set of time slots and then extracting Mel-frequency cepstrum coefficient (MFCCs) for each slot. Each set of MFCC is accepted as an input token to the Transformer, and audio features are extracted based on how different time slots interact with each other. This mechanism also makes the features more descriptive than traditional neural networks. The Keyword Transformer outputs a global feature vector that is fed into a multi-layer perceptron (MLP) that classifies audio into keywords or non-keywords.
The researchers have observed that the model benefits from large scale pre-training, rendering 5.5 times latency reduction through model compression and over 4000 times energy reduction through sparsity and hardware co-design.
Transformers’ growing popularity
The machine learning world is moving on from traditional neural networks to Transformers, facilitating the latter’s rise as the next big thing. Transformers are already being chosen for advanced natural language processing and computer vision tasks. With innovation and time, new areas of applications are being discovered. What works in favour of Transformer is the self-attention mechanism where features are dynamically calculated by attending different parts of the input to each other.