Visual speech recognition (also known as lipreading) is considered to be one of the most futuristic applications of AI. So far, there haven’t been major breakthroughs, and the domain is slowly catching up to audio-based speech recognition. Visual speech recognition provides machines with the ability to understand languages in noisy environments and can also be used for applications related to improved hearing aids and biometric authentication.
The tremendous success of deep learning in both fields has already affected visual speech recognition by shifting the research direction from handcrafted features and HMM-based models to deep feature extractors and end-to-end deep architectures. Recently introduced deep learning systems beat human lip-reading experts by a large margin, at least for the constrained vocabulary defined by each database.
Here are a few top works on visual speech recognition:
Combining Residual Networks with LSTMs for Lipreading
ResNets and LSTMs are game-changers for computer vision and NLP tasks respectively. In this work, the authors have tried to bring the combined benefits of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. They have proposed an end-to-end deep learning architecture for word-level visual speech recognition that is trained and evaluated on the Lipreading In-The-Wild benchmark, dataset.
The authors claim that the proposed network has attained word accuracy equal to 83.0, yielding 6.8 absolute improvements over the current state-of-the-art, without using information about word boundaries during training or testing.
In this paper, the researchers present a naturally-distributed large-scale benchmark for lip-reading in the wild, named LRW-1000, which contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers. Each class corresponds to the syllables of a Mandarin word composed of one or several Chinese characters.
This work has shown a large variation in the benchmark in terms of the number of samples in each class, video resolution, lighting conditions, and speakers’ pose, age, gender etc.
The authors have also evaluated several typical popular lip-reading methods and performed a thorough analysis of the results from several aspects.
Deep Word Embeddings
In this paper titled deep word embeddings for visual speech recognition, a deep learning architecture is introduced that works by summarising the embeddings of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose, and illumination.
The system comprises a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. The results show a promising 11.92% error rate on a vocabulary of 500 words. To perform low-shot learning, Probabilistic Linear Discriminant Analysis (PLDA) was deployed to model the embeddings on words unseen during training. The experiments demonstrated that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set.
Read Speech Beyond Lips
In this paper, the researchers investigate one of the most overlooked areas in VSR research — reading of the extraoral facial regions, i.e. beyond the lips. Experiments were conducted on both word-level and sentence-level benchmarks with different characteristics.
The experiments have shown improvements over previous methods that use only the lip region as inputs. The success of this method indicated that incorporating information from extraoral facial regions, even the upper face, consistently benefited VSR performance.
Detecting Adversarial Attacks On Audio-Visual Speech Recognition
Adversarial attacks have become one of the most actively researched topics in the deep learning space recently. The kind of drawback that these attacks have exposed have made researchers lookout for these anomalies even in reinforcement learning. On similar lines, in this work, a detection method is proposed based on the temporal correlation between audio and video streams.
Here, the idea is that the correlation between audio and video in adversarial examples will be lower than benign examples due to added adversarial noise.
This is the first work to touch upon detection of adversarial attacks on audio-visual speech recognition models. The experimental results claim the authors have demonstrated that the proposed approach is an effective way of detecting such attacks.
Currently, most existing methods equate VSR with automatic lip-reading, which attempts to recognise speech by analysing lip motion. However, human experience and psychological studies suggest that we do not always fix our gaze at each other’s lips during a face-to-face conversation, but rather scan the whole face repetitively.