The researchers at Microsoft have developed an AI-based gaze tracking system that works on any device. This system is correctly dubbed as ‘hardware-agnostic’, given its ability to function on any type of device; further, the researchers believe that such a feature would now lay the groundwork for developing prediction capabilities of deep neural networks to control computers, tablets, or phones using just the eyes.
This system utilises a deep neural network architecture as an appearance-based method that uses facial imagery for constrained gaze-tracking. The said facial imagery is captured on an ordinary RGB camera present in most modern computing devices. Microsoft’s new gaze-tracking architecture could find its application in enabling people with motor-neuron disabilities such as ALS and cerebral palsy to control their devices, for doctors to interact with patient information without touching the screen or the keyboard, interactive gaming, behavioural studies, and user experience research.
Eye-gaze based computers depend on complex computational tasks requiring measuring the user’s head-pose, head-positioning, eye-rotation, and the distance between the user and the object. All the mentioned variables are calculated with respect to the observer’s frame of reference, which is an assembly of an infrared light projector and a high-resolution infrared camera. The accuracy of the system is affected by several factors such as illumination, background noise, optical properties of the sensors and quality of imaging, among others. However, the issue with such standardised and general devices is that they are customised to either the device they are being used on or to user-specific calibration and hence requires the development of specialised hardware. There are several challenges to acquiring such hardware — availability, affordability, reliability, and ease of use.
The newly developed architecture by the Microsoft researchers aims at overcoming this particular problem. As part of their experiment, the researchers employed RGB cameras, present in almost all of the modern computing devices along with applications of recent advances in deep learning.
For the experiment, the researchers reproduced an iTracker network architecture for RGB-based constrained gaze-tracking as the baseline. The iTracker developed for this project did not employ any calibration of device-specific fine-tuning, as the original version. This iTracker architecture captures the eyes, the face region from the original image and a 25X25 binary face grid that indicates the positions of all the face pixels in the original image. These input images were then passed through the Eye and Face sub-networks after which the corresponding output is processed by multiple fully connected layers for estimating the gaze point coordinates.
Notably, the researchers used GazeCapture dataset, acquired on phones and tablets to train their model. GazeCapture, an MIT corpus, is also the largest dataset containing data from 1,450 people with fine-tuning which is publicly available. However, they also performed data augmentation to equip the model in handling real-world variations. Other random changes were introduced in terms of brightness, contrast, and saturation.
A single model was trained for both smartphones and tablets with the entire dataset. During the entire experiment, methods such as regularisation, data augmentation, colour transformation, and data normalization were used. The detailed experiment can be found here, and the corresponding code can be found here.
Additionally, to weed out the potential biases, the Grad-Cam++ techniques were used to generate the heat map of the model’s internal gradient activities.
What was obtained at the end was a system that achieved an RMSError of 1,8073 on GazeCapture.
This is not the first time Microsoft is experimenting with gaze-tracking. In a recent study, the researchers experimented with infrared lights around the display for eye-tracking. Additionally, Microsoft OS Windows 10 was the first to provide a technology called Eye Control for allowing users to use just eye movement to control their on-screen mouse and keyboard.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
I am a journalist with a postgraduate degree in computer network engineering. When not reading or writing, one can find me doodling away to my heart’s content.