The Lip Synch challenge, recently introduced by Google’s AI Experiment group, aims at teaching the tech giant’s AI system the art of reading lips. This initiative is being executed to help Google develop applications for people with speaking disabilities, such as Amyotrophic lateral sclerosis (ALS).
Google plans to take assistance from professional singers to help their AI systems learn the skill of synchronisation.
Sign up for your weekly dose of what's up in emerging technology.
How Does It Work?
The platform is very self-descriptively named Lip Synch and is built by YouTube for Chrome on desktop. Lip Sync offers participants to sing a particular segment of the “Dance Monkey” by Tones and I, the only permissible sound bite accepted currently. Post the performance, the video clip, without the audio, is fed into Google’s AI, which then uses the movement of the face and lips in particular to understand the lipsing.
This web experience is built on TensorFlow.js. The experiment is based on the FaceMesh Model built by Tensor.js-MediaPipe collaboration. Using the participant’s webcam, this model provides a real-time high density of key points of the facial expression, which further ensures that no data is ever saved.
The platform captures a frame-by-frame recording of the mouth shapes of the participants lined up with the music. The frames are then compared with mouth shapes as in the pre-recorded baseline. The mouth shapes are compared using the mouthShapes function in the OpenCV library that uses a technique called Hu Moments. Hu Moments are set of seven numbers, which are calculated using central moments that are invariant to image transformations; these numbers are extracted from the outline of an object from the image in question. This method is preferred as it helps in determining the mouth shapes irrespective of the translation and rotation. Along with Hu Moments, another method of identifying ‘mouth ratio’ (height to width ratio) is used to compare the two samples more accurately.
Similar Attempts In Past
Back in 2016, the University of Oxford and Google’s DeepMind undertook a project to use AI for lip reading. To create a lip reading system, deep learning was applied to a massive data set of BBC programmes. Further, the AI systems were trained using 5000 hours of videos, containing 118,000 sentences from programmes such as Newsnight, BBC Breakfast, and Question Time that aired between January 2010 and December 2015.
The performance of these trained AI systems was gauged by testing its performance on programmes aired between March and September 2016. The system could accurately decipher entire phrases. The AI system vastly outperformed professionals by correctly annotating 46.8 percent of all the words, as compared to 12.4 percent words by the latter.
In 2018, in pursuit of establishing a more performant system, a group of researchers from Alibaba, Zhejiang University, Hangzhou (China), and Stevens Institute of Technology, New Jersey introduced Lip by Speech (LIBS) method. This method was used to distil useful audio information from the human speech at several scales, including sequence, context, and frame level. This acquired data was then aligned with video data by identifying the correspondence between the two. To refine the distilled feature, a filtering technique was further used.
As per the researchers involved in this experiment, this method was able to record better performance than the baseline by 7.66 percent and 2.75 percent in terms of character error rate.
In July 2019, researchers from the Imperial College of London and the Samsung AI collaborated to propose a visual speech recognition mechanism, which can perform lip reading by designing a Generative Adversarial Network that could extract audio signals from videos of humans speaking. The model contained a module which could discriminate between the real and fake distribution of human speech. The researchers had then claimed that it was the first method that could synthesise speech from previously unseen speakers. This method was trained on the GRID dataset, which featured 1000 phrases from 33 speakers.
The newly-introduced Lip Synch as an experiment certainly gives an edge in terms of teaching lip reading to AI systems. This system is built on Tensorflow.js, which already finds great usage in the domain of emotion and movement recognition, among others, given its popular libraries for training models on web browsers. Another prominent advantage is that none of the data is stored as per Google, ruling out any apprehensions related to privacy and security.
Lipreading or visual speech recognition is one of the most ambitious applications of AI, and the domain is expected to revolutionise the application related to improved speech recognition and biometric authentication. Unsurprisingly, there are several researches being conducted in this direction, which is only expected to grow further.