Google Lens’ AI Factor

“All of Google was built because we started understanding text and web pages. So the fact that computers can understand images and videos has profound implications for our core mission”.
Google lens

The image below is from the Kalaburagi (Karnataka) Railway Station’s PSA by the child protection team. The image on the right shows the picture captured through Google Lens, which manages to translate Kannada into English.

Google Lens was launched by CEO Sundar Pichai at the Google developer conference in 2017. This announcement was part of the ‘AI first’ strategy, which was also announced at this conference. Pichai had then called it the key reflection of Google’s direction, highlighting it as an example of Google being at an ‘inflection point with vision’. He said, “All of Google was built because we started understanding text and web pages. So the fact that computers can understand images and videos has profound implications for our core mission”.

In this article, we list out major AI breakthroughs that have been responsible for making Google Lens an efficient tool.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The advanced search mechanism

Google used underlying computer vision and AI technology to make phone cameras ‘smart’. This not only enables the phone to capture pictures but also detects what one sees to take appropriate action. 

Recently, Google introduced multisearch, a combination of text and image search with Google Lens. Apart from image search, this new technology will help users ask additional questions or add extra features like shopping for clothes with particular colours or patterns. This is the public trail of Multitask Unified Model or MUM that was first introduced at the Google I/O event in 2021. MUM uses T5 text-to-text framework to understand and even generate language. It is 1,000 times more powerful than BERT and is trained across 75 different languages. MUM is multimodal, and it understands information across text and images. Google said that they would be expanding the scope of MUM to modalities like video and audio.

Images captured by Lens most include sources like signage, handwriting and documents – this gives rise to additional challenges. The challenges may include obscure text, uniquely stylised scripts or blurry images. These issues may cause the OCR engine to misinterpret various characters. To overcome these challenges, Google Lens uses Knowledge Graph to provide contextual clues.

Translation mechanisms

Lens uses Google Translate’s Neural Machine Translation (NMT) algorithm to translate the entire sentence at a time while preserving proper grammar and diction. This strategy is more efficient than the traditional word-by-word translation. For translation to be most effective, the context of the original text should be retained. To accomplish this seamlessly, Google Lens redistributes the translation into lines of similar lengths and selects an appropriate font size to match; it also matches the colour of translation and the background with the original text using a heuristic that assumes background and text are different in terms of heuristics and the former takes up the majority of the space. This helps Lens to classify the pixel into whether it represents the background or the text. Then it samples the average colour from the two regions and ensures that the translated text matches the original.

Another challenge is delivering the detected information by reading aloud the text. The high fidelity audio is generated by Google’s Text-to-Speech (TTS) service that applies machine learning to entities like addresses, dates, and phone numbers and uses that information to generate realistic speech, which in turn is based on DeepMind’s WaveNet.

Reading these features becomes more contextual when they are paired with the display. Towards achieving this goal, Lens utilises annotations from the TTS service that marks the beginning of each word and highlights each word on the screen as it is being read.

Until very recently, Lens existed only within Google Assistant. However, it is changing with the tool finding application beyond the Assistant, camera and Google Photos app. It is assisting in supporting other Google products like the Google Maps. In one of the interesting demos, Google demonstrated how Lens could power an augmented reality version which had notable locations and landmarks being called out along with visual overlays.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox