The object detection technique is considered one of the most challenging tasks in computer vision, a subset of artificial intelligence, as it involves object classification and localising the object within the image or video. Object detection is a computer vision technique that detects objects such as animals, persons, cars, buildings, etc. It has been applied widely over video surveillance, self-driving cars and object tracking.
When performing standard image classification, we present that image to the neural network for a given input image and obtain a single class or label of the most dominating object in the image with a probability score associated with it. Whereas object detection built on image classification tries to localise the object with the help of a bounding box and the probability/confidence score associated with each class.
The following are the few common architectures used for objection detection;
- R-CNN
- FAST R-CNN
- FASTER R-CNN
- MASK R-CNN
- YOLO (You Only Look Once)
- SSD ( Single Shot Multibox Defender)
Today in this article, we are going to perform object detection using a transfer learning method. We will be using InceptionResnet_v2 as our pre-trained model for this task.
Brief Introduction of the InceptionResnet_v2 architecture:
Deep convolutional networks have been at the central point when it comes to image-based tasks. The version of the inception network has shown that it can achieve very high accuracy at a relatively low computational time. The K. He, the author, introduces residual connections to deep learning, demonstrating how the residual connection has inherent importance in training deep networks. As the inception network is very deep, it is natural to replace the filter concatenation stage of the inception with the residual network. The author believes this would allow inception to reap all the benefits of the residual approach while retaining its computational efficiency.
To get more understanding of the architectures, you can check these papers, 1,2.
Now it’s time to implement object detection with InceptionResnet by leveraging python.
Code Implementation of InceptionResnet_v2 Model
The following code implementation is in reference to the official implementation.
Importing all dependencies:
import tensorflow as tf import tensorflow_hub as hub # for downloading and displaying image import matplotlib.pyplot as plt import tempfile from six.moves.urllib.request import urlopen from six import BytesIO # for dataframe import pandas as pd # for drawing onto the image import numpy as np from PIL import Image,ImageColor,ImageDraw,ImageFont,ImageOps import time
Helper functions to download, visualise and drawing on image:
def disp_ima(image): fig = plt.figure(figsize=(18, 13)) plt.grid(False) plt.imshow(image)
def get_and_reshape_img(url, width=250, height=250, display=False): ruff, name = tempfile.mkstemp(suffix=".jpg") response = urlopen(url) image_data = response.read() image_data = BytesIO(image_data) pil_ima = Image.open(image_data) pil_ima = ImageOps.fit(pil_ima, (width, height), Image.ANTIALIAS) pil_ima_rgb = pil_ima.convert("RGB") pil_ima_rgb.save(name, format="JPEG", quality=90) print("Image downloaded to %s." % name) if display: disp_ima(pil_ima) return name
def boxes_on_image(image,ymin,xmin,ymax,xmax,color,font,thickness=4,display_str_list=()): draw = ImageDraw.Draw(image) width, height = image.size (left, right, top, bottom) = (xmin * width, xmax * width, ymin * height, ymax * height) draw.line([(left, top), (left, bottom), (right, bottom), (right, top), (left, top)], width=thickness, fill=color) display_heights = [font.getsize(ds)[1] for ds in display_str_list] # Each display_str has a top and bottom margin of 0.05x. total_height = (1 + 2 * 0.05) * sum(display_heights) if top > total_height: text_bottom = top else: text_bottom = top + total_height # Reverse list and print from bottom to top. for display_str in display_str_list[::-1]: text_width, text_height = font.getsize(display_str) margin = np.ceil(0.05 * text_height) draw.rectangle([(left, text_bottom - text_height - 2 * margin), (left + text_width, text_bottom)], fill=color) draw.text((left + margin, text_bottom - text_height - margin), display_str, fill="black", font=font) text_bottom -= text_height - 2 * margin
def drawing_boxes(image, boxes, class_names, scores, max_boxes=10, min_score=0.1): colors = list(ImageColor.colormap.values()) try: font = ImageFont.truetype("/usr/share/fonts/truetype/liberation/LiberationSansNarrow-Regular.ttf",25) except IOError: print("Font not found, using default font.") font = ImageFont.load_default() for i in range(min(boxes.shape[0], max_boxes)): if scores[i] >= min_score: ymin, xmin, ymax, xmax = tuple(boxes[i]) display_str = "{}: {}%".format(class_names[i].decode("ascii"), int(100 * scores[i])) color = colors[hash(class_names[i]) % len(colors)] image_pil = Image.fromarray(np.uint8(image)).convert("RGB") boxes_on_image(image_pil,ymin,xmin,ymax,xmax,color,font,display_str_list=[display_str]) np.copyto(image, np.array(image_pil)) return image
Download and showcase the image from URL:
img_url= "http://1.bp.blogspot.com/-nn23fvzDZBw/T_HTlRZYJuI/AAAAAAAAA5U/wHSWnIySyww/s1600/best+cool+nice+cute+awesome+desktop+background+wallpapers+%252817%2529.jpg" web_img = get_and_reshape_img(img_url, 1250, 850, True)

Inferencing the architecture:
Load the model from tensorflow hub
module= "https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1" model = hub.load(module).signatures['default']
User defined function for loading the image and running the model
def load_image(path): imgage_path = tf.io.read_file(path) imgage_path = tf.image.decode_jpeg(imgage_path, channels=3) return imgage_path
def run_model(model, path): img = load_image(path) converted_img = tf.image.convert_image_dtype(img, tf.float32)[tf.newaxis, ...] start = time.time() result = model(converted_img) end = time.time() result = {key:value.numpy() for key,value in result.items()} print("Found %d objects." % len(result["detection_scores"])) print("Inference time: ", end-start) image_with_boxes = drawing_boxes( img.numpy(), result["detection_boxes"], result["detection_class_entities"], result["detection_scores"]) disp_ima(image_with_boxes)
run_model(model, web_img)

With the help of pandas, we can check the scores of each object identified; here, we will check the top 10 objects.
image = load_image(web_img) converted_image = tf.image.convert_image_dtype(image, tf.float32)[tf.newaxis, ...] result = model(converted_image)

One more example:

Conclusion
Today we have seen how the concatenation of residual networks with the inception architect enhances the overall performance of the architecture and gives accurate bounding boxes for each entity in given images. Furthermore, with this minimal code, we can easily deploy the system to the web and android platforms.