Online speech recognition is gaining prominence as people across the world are leveraging it to control devices and get answers to their queries. According to a report, around 55% of teenagers use voice searches. And not just search, in the future, people will be using their voice to command machines in order to carry out a wide range of tasks. Consequently, automatic speech recognition (ASR) is on the rise as its potential to streamline the workflows is humongous. However, high latency is impeding its adoption among users as it causes hindrance while performing tasks in real-time.
However, with Facebook’s open-source online speech recognition, wav2letter@anywhere, developers are able to utilise it to make applications for delivering superior user experience by reducing the latency. Built on top of its benchmarked libraries wave2letter++, and wav2letter, wav2letter@anywhere will provide high speed.
Facebook’s Open-Source Online Speech Recognition
Unlike other ASR that uses recurrent neural networks (RNNs), wav2letter@anywhere utilises convolutional acoustic models. The firms had ready benchmarked the reason for the full convolutional acoustic models along with connectionist temporal classification (CTC), being faster than RNN — while still achieving a better word error rate (WER). This, in turn, enables them to enhance the throughput by 3x on specific inference models. The idea behind this framework is to support end-to-end speech recognition workflows that could be used in productions for developing robust applications. For this, the social media giant focused on supporting concurrent audio streams for empowering developers to scale while improving its performance. Besides, it also offers APIs to ensure compatibility with various platforms such as Android, iOS, among others.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Written in C++, wav2letter@anywhere delivers speed from the ground up, allowing developers to make applications that can render audio and will be able to provide results quickly. The framework was developed, keeping in mind the requirement for streaming. Thus, it uses Facebook General Matrix Multiplication (FBGEMM) – a low-precision, high-performance matrix-matrix multiplications and convolution library, for a server-side interface.
Facebook also utilised time-depth separable (TDS) convolution to reduce the model size and computational flops while avoiding a hit on accuracy. Further, the firm used asymmetric padding for all the convolutions at the beginning of the input to reduce the requirement for future context by acoustic model, thereby, decreasing latency.
Leveraging the features of wav2letter++ along with modern acoustic language model architectures in both supervised and semi-supervised settings, Facebook expedited the process of speech recognition.
Impact On The Speech Recognition Community
Unlike other technologies where there are a plethora of open-source projects, speech recognition has only a few effective projects that are accessible for all. This slackens the growth in the implementation of this technology. Currently, various blue-chip companies own robust ASR but do not open-source it as they want to create a monopoly in the landscape.
Companies like Google, Amazon, and Microsoft use it for their virtual assistant while competing among themselves. Although this has allowed them to narrow the competition, they might lose it in future as several organisations such as Mozilla, and now Facebook has also made low-latency ASR public. This will allow the developers to create products that can challenge blue-chip companies.
Facebook’s open-source online speech recognition will now encourage others to contribute to their projects, which will further improve the framework. Mozilla took the same approach, and it benefited them as contributors like debinat, Carlos Fonseca, were central for achieving a low-latency in its DeepSpeech library. Such success might motivate current leaders like Google and Amazon in the ASR landscape to open-source their speech recognition in the future.
Today, open-source is crucial for any technology to expand rapidly as, over the years, collaborative efforts have helped data science marketplace to proliferate. Now with Facebook’s open-source online speech recognition, developers, in addition to Mozilla’s DeepSpeech, will have another option to choose from and leverage it upon their use cases. Such initiatives from Facebook and Mozilla will now empower developers, thereby, increasing the competition in the ASR landscape.