Extraction Of Aadhar IDs Using OpenCV & TensorFlow- Sushil Ostwal, Head Data Science at Motilal Oswal Financial Services

The second talk of the Day 1 “Automated ID Extraction From Scan Copy Of Account Opening Form” was presented at the Computer Vision conference of the year, CVDC 2020 by Sushil Ostwal, Head Data Science/AI at Motilal Oswal Financial Services.

CVDC 2020 is scheduled for 13th and 14th of August, organised by the Association of Data Scientists (ADaSCi), the premier global professional body of data science and machine learning professionals. 

Ostwal kick-started the talk by discussing how extracting ID from lakhs of the physical scanned document is a challenging and time-consuming activity. The session is divided into two parts. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In the first part, the speaker covered the approaches that were taken to extract ID (Photo, PAN Card, Address proof) from the scanned copy of physical account opening form, computer vision approach taken to solve this, automate id extraction, deployment of the solution as well as learning from the implementation. 

While the second part of the session covered the steps of masking Aadhar card number. Ostwal also discussed the technical details of the libraries, packages, model building steps and techniques used during the first session. 

Download our Mobile App

He mentioned the various use-cases that are related to the Automate ID Extraction. They are

  • Extracting ID (photo, PAN, address proof) from scan copy of account opening form can be challenging. The form runs into an 80-100 page document. The objective here is to extract the ID from lakhs of the scanned pdf file of brokerage account opening.
  • Mask the first 8 digits of aadhar.
  • PAN card and address proof can appear in any page document, photo appears in the first 15 pages of the document.
  • Address proof ID can include a driving license, voter ID, bank passbook, the electric bill, passport, telephone bill, aadhar card.
  • Quality of scan document can vary from good to poor
  • Solution available in the market can be costly.

Talking about the libraries and techniques, the speaker mentioned a number of libraries which can be used in the model. However, he mainly stressed on 4 specific libraries which have been mainly used during the extraction. The libraries are-

  • Pytesseract: This library is used to read text from images using OCR based on Tesseract.
  • CV2: This is an OpenCV library 
  • TensorFlow: TensorFlow is used to provide workflows to develop and train object detection models.
  • ImageMagick: This library is used for image editing, such as cropping, masking, etc.  

Ostwal then walked through the computer vision approach that was taken to solve the problem. He gave a detailed explanation of each and every process in the approach and they work. Below here is a brief explanation of the methods-

1| Photo Extraction

The pre-processing of the photo-extraction includes defining the initial variable, converting pdf documents into images, converting the images into grayscale for better readability and then defining the function to extract photo from image file.

The steps included here are-

  • Iterate through first 15 pages of the pdf
  • Define model object- Cascadeclassifier open cv
  • Model parameters- detectMultiScale3. This includes important hyperparameters such as Scalefactor, minNeighbors, minsize and objectRejectLevels.
  • Model building- Face detection classification. This includes extracting the face if the probability is greater than the threshold.
  • Enlarge extracted face to get the full photograph.

2| Address Extraction

The second step is the address extraction, where the pre-processing includes address proof such as driving license, passport, etc., converting the images into grayscale, image thresholding such as adaptive thresholding for better image results, scaling, converting images to string function, converting text to lowercase and addressing the type keyword list and negative word list for correct classification. 

3| PAN Extraction

The next step is the PAN extraction, where the pre-processing steps include checking file size of the image, converting the images to grayscale, image thresholding by applying the global threshold, increasing the image resolution by doubling the height and width, converting the images into texts, converting the text to lowercase, defining PAN keywords list, rotating the images as well as expanding the image dimensions for TensorFlow.

In this process, TensorFlow SSD Inception V2 is used and the above-trained model is used for PAN detection, which according to Ostwal returns a probability of the image having PAN. According to him, if the PAN probability is greater than the threshold and a face is detected, the PAN is extracted. 

This ends the first session of the talk. The second session, which is masking of Aadhaar number includes two models, which is masking Aadhar number in page and masking Aadhar number in Aadhar. The pre-processing of the Aadhar number includes the following steps-

  • Convert pdf to images
  • Set image resolution to 300*300
  • Adaptive thresholding and smoothening (Gaussian Blur) for better results
  • OCR for converting the image to text and converting text to lowercase
  • Define Aadhar keywords list
  • Defining Aadhar number regex patterns to extract the first 8 digits
  • Set page numbers to check for Aadhar card document 
  • For every version of form, extract the coordinate of the first 8 digits of the Aadhar number. 


Ostwal stated that the full deployment of the model took around 2 months by executing the batches of 1000 images in 3 high-end servers, where each server had 8 instances of the algorithm running in parallel.  

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.