Top 5 Researches On Visual Speech Recognition

Visual speech recognition (also known as lipreading) is considered to be one of the most futuristic applications of AI. So far, there haven’t been major breakthroughs, and the domain is slowly catching up to audio-based speech recognition. Visual speech recognition provides machines with the ability to understand languages in noisy environments and can also be used for applications related to improved hearing aids and biometric authentication. 


Sign up for your weekly dose of what's up in emerging technology.

The tremendous success of deep learning in both fields has already affected visual speech recognition by shifting the research direction from handcrafted features and HMM-based models to deep feature extractors and end-to-end deep architectures. Recently introduced deep learning systems beat human lip-reading experts by a large margin, at least for the constrained vocabulary defined by each database.

Here are a few top works on visual speech recognition:

Combining Residual Networks with LSTMs for Lipreading

ResNets and LSTMs are game-changers for computer vision and NLP tasks respectively. In this work, the authors have tried to bring the combined benefits of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. They have proposed an end-to-end deep learning architecture for word-level visual speech recognition that is trained and evaluated on the Lipreading In-The-Wild benchmark, dataset. 

The authors claim that the proposed network has attained word accuracy equal to 83.0, yielding 6.8 absolute improvements over the current state-of-the-art, without using information about word boundaries during training or testing.


In this paper, the researchers present a naturally-distributed large-scale benchmark for lip-reading in the wild, named LRW-1000, which contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers. Each class corresponds to the syllables of a Mandarin word composed of one or several Chinese characters. 

This work has shown a large variation in the benchmark in terms of the number of samples in each class, video resolution, lighting conditions, and speakers’ pose, age, gender etc.

The authors have also evaluated several typical popular lip-reading methods and performed a thorough analysis of the results from several aspects.

Deep Word Embeddings 

In this paper titled deep word embeddings for visual speech recognition, a deep learning architecture is introduced that works by summarising the embeddings of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose, and illumination. 

The system comprises a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. The results show a promising 11.92% error rate on a vocabulary of 500 words. To perform low-shot learning, Probabilistic Linear Discriminant Analysis (PLDA) was deployed to model the embeddings on words unseen during training. The experiments demonstrated that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set. 

Read Speech Beyond Lips

In this paper, the researchers investigate one of the most overlooked areas in VSR research — reading of the extraoral facial regions, i.e. beyond the lips. Experiments were conducted on both word-level and sentence-level benchmarks with different characteristics. 

The experiments have shown improvements over previous methods that use only the lip region as inputs. The success of this method indicated that incorporating information from extraoral facial regions, even the upper face, consistently benefited VSR performance. 

Detecting Adversarial Attacks On Audio-Visual Speech Recognition

Adversarial attacks have become one of the most actively researched topics in the deep learning space recently. The kind of drawback that these attacks have exposed have made researchers lookout for these anomalies even in reinforcement learning. On similar lines, in this work, a detection method is proposed based on the temporal correlation between audio and video streams. 

Here, the idea is that the correlation between audio and video in adversarial examples will be lower than benign examples due to added adversarial noise. 

This is the first work to touch upon detection of adversarial attacks on audio-visual speech recognition models. The experimental results claim the authors have demonstrated that the proposed approach is an effective way of detecting such attacks. 

Currently, most existing methods equate VSR with automatic lip-reading, which attempts to recognise speech by analysing lip motion. However, human experience and psychological studies suggest that we do not always fix our gaze at each other’s lips during a face-to-face conversation, but rather scan the whole face repetitively.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM