The second talk of the Day 1 “Automated ID Extraction From Scan Copy Of Account Opening Form” was presented at the Computer Vision conference of the year, CVDC 2020 by Sushil Ostwal, Head Data Science/AI at Motilal Oswal Financial Services.
CVDC 2020 is scheduled for 13th and 14th of August, organised by the Association of Data Scientists (ADaSCi), the premier global professional body of data science and machine learning professionals.
Ostwal kick-started the talk by discussing how extracting ID from lakhs of the physical scanned document is a challenging and time-consuming activity. The session is divided into two parts.
In the first part, the speaker covered the approaches that were taken to extract ID (Photo, PAN Card, Address proof) from the scanned copy of physical account opening form, computer vision approach taken to solve this, automate id extraction, deployment of the solution as well as learning from the implementation.
While the second part of the session covered the steps of masking Aadhar card number. Ostwal also discussed the technical details of the libraries, packages, model building steps and techniques used during the first session.
He mentioned the various use-cases that are related to the Automate ID Extraction. They are
- Extracting ID (photo, PAN, address proof) from scan copy of account opening form can be challenging. The form runs into an 80-100 page document. The objective here is to extract the ID from lakhs of the scanned pdf file of brokerage account opening.
- Mask the first 8 digits of aadhar.
- PAN card and address proof can appear in any page document, photo appears in the first 15 pages of the document.
- Address proof ID can include a driving license, voter ID, bank passbook, the electric bill, passport, telephone bill, aadhar card.
- Quality of scan document can vary from good to poor
- Solution available in the market can be costly.
Talking about the libraries and techniques, the speaker mentioned a number of libraries which can be used in the model. However, he mainly stressed on 4 specific libraries which have been mainly used during the extraction. The libraries are-
- Pytesseract: This library is used to read text from images using OCR based on Tesseract.
- CV2: This is an OpenCV library
- TensorFlow: TensorFlow is used to provide workflows to develop and train object detection models.
- ImageMagick: This library is used for image editing, such as cropping, masking, etc.
Ostwal then walked through the computer vision approach that was taken to solve the problem. He gave a detailed explanation of each and every process in the approach and they work. Below here is a brief explanation of the methods-
1| Photo Extraction
The pre-processing of the photo-extraction includes defining the initial variable, converting pdf documents into images, converting the images into grayscale for better readability and then defining the function to extract photo from image file.
The steps included here are-
- Iterate through first 15 pages of the pdf
- Define model object- Cascadeclassifier open cv
- Model parameters- detectMultiScale3. This includes important hyperparameters such as Scalefactor, minNeighbors, minsize and objectRejectLevels.
- Model building- Face detection classification. This includes extracting the face if the probability is greater than the threshold.
- Enlarge extracted face to get the full photograph.
2| Address Extraction
The second step is the address extraction, where the pre-processing includes address proof such as driving license, passport, etc., converting the images into grayscale, image thresholding such as adaptive thresholding for better image results, scaling, converting images to string function, converting text to lowercase and addressing the type keyword list and negative word list for correct classification.
3| PAN Extraction
The next step is the PAN extraction, where the pre-processing steps include checking file size of the image, converting the images to grayscale, image thresholding by applying the global threshold, increasing the image resolution by doubling the height and width, converting the images into texts, converting the text to lowercase, defining PAN keywords list, rotating the images as well as expanding the image dimensions for TensorFlow.
In this process, TensorFlow SSD Inception V2 is used and the above-trained model is used for PAN detection, which according to Ostwal returns a probability of the image having PAN. According to him, if the PAN probability is greater than the threshold and a face is detected, the PAN is extracted.
This ends the first session of the talk. The second session, which is masking of Aadhaar number includes two models, which is masking Aadhar number in page and masking Aadhar number in Aadhar. The pre-processing of the Aadhar number includes the following steps-
- Convert pdf to images
- Set image resolution to 300*300
- Adaptive thresholding and smoothening (Gaussian Blur) for better results
- OCR for converting the image to text and converting text to lowercase
- Define Aadhar keywords list
- Defining Aadhar number regex patterns to extract the first 8 digits
- Set page numbers to check for Aadhar card document
- For every version of form, extract the coordinate of the first 8 digits of the Aadhar number.
Ostwal stated that the full deployment of the model took around 2 months by executing the batches of 1000 images in 3 high-end servers, where each server had 8 instances of the algorithm running in parallel.
If you loved this story, do join our Telegram Community.
Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box. Contact: email@example.com