MITB Banner

On-Device Speech Representation Using TensorFlow Lite

Share

Representation learning is a machine learning (ML) method that trains a model to discover prominent features. It may apply to a wide range of downstream tasks– including Natural Language Processing (BERT and ALBERT) and picture analysis and classification (Inception layers and SimCLR). Last year, researchers developed a baseline for comparing speech representations and a new, general-purpose speech representation model, TRILL. It is based on temporal proximity and attempts to map speech that happens close together in time to a lower-dimensional embedding space that captures temporal proximity. 

Researchers have now created a new speech model for mobile devices– FRILL. The recently accepted paper FRILL: A Non-Semantic Speech Embedding for Mobile Devices will be presented at the Interspeech 2021 conference. It provides a new lightweight non-semantic speech embedding model based on TRILL speech embedding. 

To achieve this, researchers provided unique architectural improvements in:

  • Developing a class of non-semantic embedding models that can run in real-time on a mobile device.
  • Assessing the effect of performance optimisation techniques such as quantisation-aware training, model compression, and architectural reductions on the latency, accuracy, and size of embedding models
  • Benchmarking the on-device representations on two mobile-health tasks– detecting face-masked speech and a public dataset of human sounds.

The development of a new model represents a significant step towards fully on-device implementations of speech ML models– resulting in better personalisation, improved user experiences, and more privacy– all of which are crucial aspects of developing AI responsibly. The FRILL code was published on GitHub, along with a pre-trained FRILL model on TensorFlow Hub.

Hyperparameters for experimentation

Researchers performed distillation using a number of student models to get a lightweight non-semantic speech embedding model that runs efficiently on mobile devices. Each of them trained with a special architectural combination. They used TensorFlow Lite (TFLite) to measure each student model’s latency since this framework permits using TensorFlow models on edge devices. Described below are a set of hyperparameters set by researchers to investigate the trade-off between student model performance and latency.

  • Size and width of MobileNetV3: MobileNetV3 was published in different sizes for use in a variety of situations. The size is related to the used MobileNetV3 architecture. The width, often known as alpha, affects the number of filters in each layer proportionately.
  • Pooling of global averages: Normally, MobileNetV3 generates a set of two-dimensional feature maps. These are flattened, concatenated, and transferred to the bottleneck layer. This bottleneck, however, is frequently too large to be computed immediately. Researchers lower the size of the bottleneck layer kernel by taking the global average of all ‘pixels’ in each output feature map. 
  • Bottleneck compression: A major amount of the weights in the student model are positioned in the bottleneck layer. Researchers used a compression operator based on singular value decomposition (SVD) to learn a low-rank approximation of the bottleneck weight matrix to minimise the size of this layer.
  • Bottleneck Layer Quantization: The bottleneck layer contains the majority of the model weights; researchers used quantisation-aware training (QAT) to gradually reduce the numerical precision of the bottleneck weights throughout training. 

Outcome of experiment

Researchers carried out the study on each of the hyperparameters mentioned above to establish the impact of the student embedding models’ quality, latency, and size. They tested each model on the Non-Semantic Speech Benchmark (NOSS) and two new tasks– to determine whether a speaker is wearing a mask and the human-noise subset of the Environment Sound Classification dataset, which contains labels like ‘coughing’ and ‘sneezing.’ After eliminating models with faster alternatives in a batch of 144 models on the quality versus latency curve, they were left with eight ‘frontier’ models. The outcome of this was one model, which was much faster than TRILL. 

(Source: Google AI Blog)

FRILL vs TRILL 

With an inference time of 8.5 ms on a Pixel 1 (approximately 32 times quicker than TRILL), the research helped in identifying FRILL as the top-performing sub-10ms inference model. It is also roughly 40 percent the size of TRILL. The frontier curve peaks around 10ms latency, indicating that with low latency, one may obtain considerably better performance with minimal latency costs; however, improving performance beyond 10ms is more challenging. The result validates the researcher’s hyperparameter selection for the experiment. The table below shows FRILL’s per-task performance in comparison to TRILL: 

                                        FRILL              TRILL

               Size (MB)                38.5            98.1

             Latency (ms)                8.5            275.3

             Voxceleb1*                45.5            46.8

             Voxforge                78.8            84.5

             Speech Commands        81.0                81.7

             CREMA-D                71.3            65.9

             SAVEE                        63.3            70.0

             Masked Speech        68.0            65.8

             ESC-50 HS                87.9            86.4

(Source: Google AI Blog)

Summing up 

The research is a critical step toward delivering the full benefits of speech ML technology to mobile devices. According to the researchers, future work will include benchmarking more tasks in this category. They also shared their public model, corresponding model card, and evaluation code to assist the research community in responsibly developing more applications for on-device speech representation research.

PS: The story was written using a keyboard.
Share
Picture of Ritika Sagar

Ritika Sagar

Ritika Sagar is currently pursuing PDG in Journalism from St. Xavier's, Mumbai. She is a journalist in the making who spends her time playing video games and analyzing the developments in the tech world.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India