MITB Banner

How differential binarization can improve real-time scene text detection

Effective Binarization can help in better image segmentation.

Share

Illustration by Analytics India Magazine

Listen to this story

Ever wondered how Google Lens scans and reads wifi password at your favourite Starbucks? The answer lies in real time scene text detection.

In a recent paper, Real Time Scene-Text Detection with Differentiable Binarization, researchers have proposed an adaptive segmentation technique for various use cases of scene text detection. Before proceeding further, let’s understand Image segmentation and Image Binarization. 

Image segmentation is a technique of grouping similar segments of an image together based on similarity in pixel values (Semantic Segmentation) or based on similarity in Instances (Instance Segmentation) of the same object in an image. 

Fig 1: Example of Image Segmentation in autonomous driving systems 

Fig 2: How Image Segmentation fits in a Machine Learning Architecture

Image Binarization is a technique to convert any image to a binary scaled image.

Fig 3: Example of Image Binarization

In most text detection systems, segmentation is followed by Binarization and pixel grouping techniques to detect texts in real scenarios. The proposed architecture in this paper uses binarization in the segmentation network using a Differentiable Binarization Function so that the binarized image can be trained end to end in a CNN. 

Fig 4 : Model Architecture

 Three key concepts in the model architecture are:

Differentiable Binarization: Any standard binarization function like one mentioned below (P is the probability map and t is the set threshold) would be non – differentiable and hence it cannot be optimised along with segmentation.

Hence, to make binarization optimizable, a function like mentioned below (P and T are the probability map and threshold map learned from the network) would be differentiable:-

Adaptive Thresholding: Adaptive Thresholding is optimized thresholding for effective binarization within the segmentation network. Refer Adaptive Threshold under section Methodology in the paper. 

Deformable Convolution: This step helps in the convoluting over spatial images or images with larger aspect ratios. Refer here for more on Deformable Convolution Network.

To get an understanding of how the deformable convolution has been performed in the paper take a look here.

Fig 5:- Source code snippet on implementation of the Deformable Convolution layer

How does the model work?

  1. The input image is firstly fed into a feature-pyramid backbone (for e.g using RESNET). To know more about feature-pyramid backbone refer here. 
  2. Then, the obtained pyramid features are up-sampled (using image augmentation )to the same scale of the input image and cascaded to produce a feature F, which will be used to predict probability and threshold.
  3. Obtained  feature F from the previous step  is used to predict both the probability map (P) and the threshold map (T). Threshold map generates threshold for effective binarization in the following steps.
  1. After that, the approximate binary map (Bˆ) is calculated by P and F as enlisted in formula (2) of the paper.
  1. Supervision is then applied during model training on probability map, threshold map and approximate binary map with probability and binary map sharing the same supervision.
  1. In the inference period, the bounding boxes can be obtained easily from the approximate binary map or the probability map by a box formulation module using the Vatti Clipping Algorithm as enlisted in formula (10) of the paper.

Fig 6:- Some visualisation results on text instances of various shapes, including curved text, multi-oriented text, vertical text, and long text lines. For each unit, the top right is the threshold map; the bottom right is the probability map.

As per review metrics, DB has worked well with RESNET-50 and RESNET-18 as backbones for real time inference and accurate text detections. However, the method can’t handle cases of “text inside text” or if an overlapping text region is in the centre region of another text instance.

Refer the section Implementation Details to know how the model was trained , sections Limitation and Conclusion for more details on the model performance.

For accessing the entire research paper with highlighted sections and added notes, click here.

Share
Picture of Souptik Majumder

Souptik Majumder

I am an AI/ML Enthusiast and a working data scientist. I write to demystify interesting AI/ML Research papers. My inspiration to study more about AI and ML came after watching the show ‘Person of Interest!’ Go watch it if you haven’t already! Cheers!
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.