How much easier would life be, if there was an application that would transcribe your recorded lectures or interviews as notes for you? How about if one can also visualise easily which part of those hour-long lectures would have the much-needed emphasis.
This can be done using a typical slide bar on the screen where users go to and forth until they get the right one. However, in this comfort-driven era that crowns optimisation and ease of operation, there is no room for trial and error when the algorithm can do that for us in the background.
Google, which has spearheaded the ML research in the last decade, has packaged all nice things into an app and have released it by the name of Recorder. Right now this option is only available to Pixel4 users.
However, the machine learning techniques that are deployed to make this happen is exemplary how complex a workflow can be even behind trivial tasks.
Evolution of Speech Recording
The speech recording process has changed drastically with the advent of neural networks. The process became lighter and the ease with which the wide variety of tasks can be performed has improved too.
Pre-neural network setup:
Usually, an acoustic model is used that maps segments of audio (~ 10 ms frames) to
- then on to a pronunciation model that connects phonemes together to form words, and
- Then onto a language model
This arrangement facilitates the expression of the likelihood of given phrases.
With the neural network:
In 2014, researchers began to focus on training a single neural network to directly map an input audio waveform to an output sentence. The sequence-to-sequence approach aimed to,
- Using “attention-based” and “listen-attend-spell” models to generate a sequence of words or graphemes given a sequence of audio features.
- Models are made to check the entire sequence as the input comes in; a necessary feature for real-time voice synthesis.
This later led to the creation of the RNN-T (Recurrent Neural Network Transducers) architecture that is adopted in the latest Pixel 4 phones as well.
How RNN-T Is Being Used
The RNN-T recogniser outputs characters one-by-one by using a feedback loop that feeds symbols predicted by the model back into it to predict the next symbols, as described in the figure above.
In the above illustration, one can see the following.
- ‘x’, is the audio samples input. Whereas, the predicted symbols are denoted as ‘y’, which is an output of the Softmax layer.
- The prediction is fed back into the model through the Prediction network, as yu-1 .
- The Prediction and Encoder Networks at the bottom of the figure are nothing but LSTM(long short term memory) RNNs, the Joint model is a feedforward network.
RNN-T needs to continuously process the waveform(input) to produce a sentence(output). And, unlike most sequence-to-sequence models, RNN-Ts do not employ attention mechanisms.
In order to host these speech recognition models directly on the device where the decoding is done by performing a beam search through a single neural network.
A traditional speech recognition system composed as search graph(read as edges and vertices) would still take up nearly 2GB of memory to accommodate acoustic, pronunciation and language model.
The RNN-T that is presented as an alternative here, trained offers the same accuracy as the traditional server-based models but is only 450MB.
However, even on today’s smartphones, 450MB is a lot, and propagating signals through such a large network can be slow.
To tackle this latency and storage issue, the developers at Google, have started using model optimisation toolkit in the Tensorflow Lite library.
Visualising Sounds And Many More, How The Recorder Gets It Right
Recorder, a new kind of audio recording app for Google Pixel phones leverages the above-discussed techniques to transcribe conversations, to detect audio bits like applause, laughter or even whistling.
All of these features can be run entirely on-device, without the need for an internet connection.
The model here can transcribe hour-long audio recordings reliably, while also indexing conversation with timestamps to map words so that the user can just click on a word in the transcription and initiate playback starting from that point in the recording.
The most interesting feature in the new Recorder app is, it allows users to visually search for sections of a recording based on specific moments or sounds.
For enabling this, the team at Google used convolutional neural networks(CNNs). This model is trained on the audio set ontology dataset, which was released a couple of years ago. This dataset contains a vocabulary of sound classes that provided a consistent level of detail over the spectrum of labelled sound events.
Based on this, the Recorder represents audio-visually as a coloured waveform where each colour is associated with a different sound category.
The colourised waveform lets users understand what type of content was captured in a specific recording and navigate along with an ever-growing audio library more easily.
To segregate the audio classes with greater accuracy, the raw audio is segmented into 50ms window. Now every passing frame is given a sigmoid score that indicates the probability of that audio belonging to a certain class of sounds.
Once a recording is done, Recorder suggests three tags that the app deems to represent the most memorable content, enabling the user to quickly compose a meaningful title.
To suggest the tags, by considering the grammatical role of the sentence too, Recorder is incorporated with a boosted decision tree with conversational data and utilised textual features like document words frequency and specificity.
Speech is one of the most important modes of communication and we have just gone through what the machines have to do something as simple as identifying an audio sample from barking of a dog to a man whistling. The research is headed in the right direction but it also tells us why we should leave no stone unturned to accelerate the research to make any kind of significant advance in the next decade.