Representation learning is a machine learning (ML) method that trains a model to discover prominent features. It may apply to a wide range of downstream tasks– including Natural Language Processing (BERT and ALBERT) and picture analysis and classification (Inception layers and SimCLR). Last year, researchers developed a baseline for comparing speech representations and a new, general-purpose speech representation model, TRILL. It is based on temporal proximity and attempts to map speech that happens close together in time to a lower-dimensional embedding space that captures temporal proximity.
Researchers have now created a new speech model for mobile devices– FRILL. The recently accepted paper FRILL: A Non-Semantic Speech Embedding for Mobile Devices will be presented at the Interspeech 2021 conference. It provides a new lightweight non-semantic speech embedding model based on TRILL speech embedding.
To achieve this, researchers provided unique architectural improvements in:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
- Developing a class of non-semantic embedding models that can run in real-time on a mobile device.
- Assessing the effect of performance optimisation techniques such as quantisation-aware training, model compression, and architectural reductions on the latency, accuracy, and size of embedding models
- Benchmarking the on-device representations on two mobile-health tasks– detecting face-masked speech and a public dataset of human sounds.
The development of a new model represents a significant step towards fully on-device implementations of speech ML models– resulting in better personalisation, improved user experiences, and more privacy– all of which are crucial aspects of developing AI responsibly. The FRILL code was published on GitHub, along with a pre-trained FRILL model on TensorFlow Hub.
Hyperparameters for experimentation
Researchers performed distillation using a number of student models to get a lightweight non-semantic speech embedding model that runs efficiently on mobile devices. Each of them trained with a special architectural combination. They used TensorFlow Lite (TFLite) to measure each student model’s latency since this framework permits using TensorFlow models on edge devices. Described below are a set of hyperparameters set by researchers to investigate the trade-off between student model performance and latency.
- Size and width of MobileNetV3: MobileNetV3 was published in different sizes for use in a variety of situations. The size is related to the used MobileNetV3 architecture. The width, often known as alpha, affects the number of filters in each layer proportionately.
- Pooling of global averages: Normally, MobileNetV3 generates a set of two-dimensional feature maps. These are flattened, concatenated, and transferred to the bottleneck layer. This bottleneck, however, is frequently too large to be computed immediately. Researchers lower the size of the bottleneck layer kernel by taking the global average of all ‘pixels’ in each output feature map.
- Bottleneck compression: A major amount of the weights in the student model are positioned in the bottleneck layer. Researchers used a compression operator based on singular value decomposition (SVD) to learn a low-rank approximation of the bottleneck weight matrix to minimise the size of this layer.
- Bottleneck Layer Quantization: The bottleneck layer contains the majority of the model weights; researchers used quantisation-aware training (QAT) to gradually reduce the numerical precision of the bottleneck weights throughout training.
Outcome of experiment
Researchers carried out the study on each of the hyperparameters mentioned above to establish the impact of the student embedding models’ quality, latency, and size. They tested each model on the Non-Semantic Speech Benchmark (NOSS) and two new tasks– to determine whether a speaker is wearing a mask and the human-noise subset of the Environment Sound Classification dataset, which contains labels like ‘coughing’ and ‘sneezing.’ After eliminating models with faster alternatives in a batch of 144 models on the quality versus latency curve, they were left with eight ‘frontier’ models. The outcome of this was one model, which was much faster than TRILL.
(Source: Google AI Blog)
FRILL vs TRILL
With an inference time of 8.5 ms on a Pixel 1 (approximately 32 times quicker than TRILL), the research helped in identifying FRILL as the top-performing sub-10ms inference model. It is also roughly 40 percent the size of TRILL. The frontier curve peaks around 10ms latency, indicating that with low latency, one may obtain considerably better performance with minimal latency costs; however, improving performance beyond 10ms is more challenging. The result validates the researcher’s hyperparameter selection for the experiment. The table below shows FRILL’s per-task performance in comparison to TRILL:
Size (MB) 38.5 98.1
Latency (ms) 8.5 275.3
Voxceleb1* 45.5 46.8
Voxforge 78.8 84.5
Speech Commands 81.0 81.7
CREMA-D 71.3 65.9
SAVEE 63.3 70.0
Masked Speech 68.0 65.8
ESC-50 HS 87.9 86.4
(Source: Google AI Blog)
The research is a critical step toward delivering the full benefits of speech ML technology to mobile devices. According to the researchers, future work will include benchmarking more tasks in this category. They also shared their public model, corresponding model card, and evaluation code to assist the research community in responsibly developing more applications for on-device speech representation research.