On-Device Speech Representation Using TensorFlow Lite

Representation learning is a machine learning (ML) method that trains a model to discover prominent features. It may apply to a wide range of downstream tasks– including Natural Language Processing (BERT and ALBERT) and picture analysis and classification (Inception layers and SimCLR). Last year, researchers developed a baseline for comparing speech representations and a new, general-purpose speech representation model, TRILL. It is based on temporal proximity and attempts to map speech that happens close together in time to a lower-dimensional embedding space that captures temporal proximity. 

Researchers have now created a new speech model for mobile devices– FRILL. The recently accepted paper FRILL: A Non-Semantic Speech Embedding for Mobile Devices will be presented at the Interspeech 2021 conference. It provides a new lightweight non-semantic speech embedding model based on TRILL speech embedding. 

To achieve this, researchers provided unique architectural improvements in:

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
  • Developing a class of non-semantic embedding models that can run in real-time on a mobile device.
  • Assessing the effect of performance optimisation techniques such as quantisation-aware training, model compression, and architectural reductions on the latency, accuracy, and size of embedding models
  • Benchmarking the on-device representations on two mobile-health tasks– detecting face-masked speech and a public dataset of human sounds.

The development of a new model represents a significant step towards fully on-device implementations of speech ML models– resulting in better personalisation, improved user experiences, and more privacy– all of which are crucial aspects of developing AI responsibly. The FRILL code was published on GitHub, along with a pre-trained FRILL model on TensorFlow Hub.

Hyperparameters for experimentation

Researchers performed distillation using a number of student models to get a lightweight non-semantic speech embedding model that runs efficiently on mobile devices. Each of them trained with a special architectural combination. They used TensorFlow Lite (TFLite) to measure each student model’s latency since this framework permits using TensorFlow models on edge devices. Described below are a set of hyperparameters set by researchers to investigate the trade-off between student model performance and latency.

  • Size and width of MobileNetV3: MobileNetV3 was published in different sizes for use in a variety of situations. The size is related to the used MobileNetV3 architecture. The width, often known as alpha, affects the number of filters in each layer proportionately.
  • Pooling of global averages: Normally, MobileNetV3 generates a set of two-dimensional feature maps. These are flattened, concatenated, and transferred to the bottleneck layer. This bottleneck, however, is frequently too large to be computed immediately. Researchers lower the size of the bottleneck layer kernel by taking the global average of all ‘pixels’ in each output feature map. 
  • Bottleneck compression: A major amount of the weights in the student model are positioned in the bottleneck layer. Researchers used a compression operator based on singular value decomposition (SVD) to learn a low-rank approximation of the bottleneck weight matrix to minimise the size of this layer.
  • Bottleneck Layer Quantization: The bottleneck layer contains the majority of the model weights; researchers used quantisation-aware training (QAT) to gradually reduce the numerical precision of the bottleneck weights throughout training. 

Outcome of experiment

Researchers carried out the study on each of the hyperparameters mentioned above to establish the impact of the student embedding models’ quality, latency, and size. They tested each model on the Non-Semantic Speech Benchmark (NOSS) and two new tasks– to determine whether a speaker is wearing a mask and the human-noise subset of the Environment Sound Classification dataset, which contains labels like ‘coughing’ and ‘sneezing.’ After eliminating models with faster alternatives in a batch of 144 models on the quality versus latency curve, they were left with eight ‘frontier’ models. The outcome of this was one model, which was much faster than TRILL. 

(Source: Google AI Blog)


With an inference time of 8.5 ms on a Pixel 1 (approximately 32 times quicker than TRILL), the research helped in identifying FRILL as the top-performing sub-10ms inference model. It is also roughly 40 percent the size of TRILL. The frontier curve peaks around 10ms latency, indicating that with low latency, one may obtain considerably better performance with minimal latency costs; however, improving performance beyond 10ms is more challenging. The result validates the researcher’s hyperparameter selection for the experiment. The table below shows FRILL’s per-task performance in comparison to TRILL: 

                                        FRILL              TRILL

               Size (MB)                38.5            98.1

             Latency (ms)                8.5            275.3

             Voxceleb1*                45.5            46.8

             Voxforge                78.8            84.5

             Speech Commands        81.0                81.7

             CREMA-D                71.3            65.9

             SAVEE                        63.3            70.0

             Masked Speech        68.0            65.8

             ESC-50 HS                87.9            86.4

(Source: Google AI Blog)

Summing up 

The research is a critical step toward delivering the full benefits of speech ML technology to mobile devices. According to the researchers, future work will include benchmarking more tasks in this category. They also shared their public model, corresponding model card, and evaluation code to assist the research community in responsibly developing more applications for on-device speech representation research.

Ritika Sagar
Ritika Sagar is currently pursuing PDG in Journalism from St. Xavier's, Mumbai. She is a journalist in the making who spends her time playing video games and analyzing the developments in the tech world.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox