The image below is from the Kalaburagi (Karnataka) Railway Station’s PSA by the child protection team. The image on the right shows the picture captured through Google Lens, which manages to translate Kannada into English.
Google Lens was launched by CEO Sundar Pichai at the Google developer conference in 2017. This announcement was part of the ‘AI first’ strategy, which was also announced at this conference. Pichai had then called it the key reflection of Google’s direction, highlighting it as an example of Google being at an ‘inflection point with vision’. He said, “All of Google was built because we started understanding text and web pages. So the fact that computers can understand images and videos has profound implications for our core mission”.
Sign up for your weekly dose of what's up in emerging technology.
In this article, we list out major AI breakthroughs that have been responsible for making Google Lens an efficient tool.
The advanced search mechanism
Google used underlying computer vision and AI technology to make phone cameras ‘smart’. This not only enables the phone to capture pictures but also detects what one sees to take appropriate action.
Recently, Google introduced multisearch, a combination of text and image search with Google Lens. Apart from image search, this new technology will help users ask additional questions or add extra features like shopping for clothes with particular colours or patterns. This is the public trail of Multitask Unified Model or MUM that was first introduced at the Google I/O event in 2021. MUM uses T5 text-to-text framework to understand and even generate language. It is 1,000 times more powerful than BERT and is trained across 75 different languages. MUM is multimodal, and it understands information across text and images. Google said that they would be expanding the scope of MUM to modalities like video and audio.
Images captured by Lens most include sources like signage, handwriting and documents – this gives rise to additional challenges. The challenges may include obscure text, uniquely stylised scripts or blurry images. These issues may cause the OCR engine to misinterpret various characters. To overcome these challenges, Google Lens uses Knowledge Graph to provide contextual clues.
Lens uses Google Translate’s Neural Machine Translation (NMT) algorithm to translate the entire sentence at a time while preserving proper grammar and diction. This strategy is more efficient than the traditional word-by-word translation. For translation to be most effective, the context of the original text should be retained. To accomplish this seamlessly, Google Lens redistributes the translation into lines of similar lengths and selects an appropriate font size to match; it also matches the colour of translation and the background with the original text using a heuristic that assumes background and text are different in terms of heuristics and the former takes up the majority of the space. This helps Lens to classify the pixel into whether it represents the background or the text. Then it samples the average colour from the two regions and ensures that the translated text matches the original.
Another challenge is delivering the detected information by reading aloud the text. The high fidelity audio is generated by Google’s Text-to-Speech (TTS) service that applies machine learning to entities like addresses, dates, and phone numbers and uses that information to generate realistic speech, which in turn is based on DeepMind’s WaveNet.
Reading these features becomes more contextual when they are paired with the display. Towards achieving this goal, Lens utilises annotations from the TTS service that marks the beginning of each word and highlights each word on the screen as it is being read.
Until very recently, Lens existed only within Google Assistant. However, it is changing with the tool finding application beyond the Assistant, camera and Google Photos app. It is assisting in supporting other Google products like the Google Maps. In one of the interesting demos, Google demonstrated how Lens could power an augmented reality version which had notable locations and landmarks being called out along with visual overlays.