Explainable image classification using Faster R-CNN and Grad-Cam

Grad-Cam is an algorithm applied with CNN models to make computer vision-based predictions explainable. In this article, we will discuss how we can simply apply Grad-CAM methods with the Faster R-CNN in the PyTorch environment and make the image classification explainable.

As neural networks are black-box models, it is hard to interpret the predictive results generated by them. Most of the deep learning models are based on neural networks and so the working of those deep learning models also becomes a black box. To explain the results generated by deep learning models, different techniques are used to make it a little interpretable. Grad-Cam is such an algorithm applied with CNN models to make computer vision-based predictions explainable. In this article, we will discuss how we can simply apply Grad-CAM methods with the Faster R-CNN in the PyTorch environment and make the image classification explainable. The major points to be discussed in this article are listed below.

Table of contents 

  1. What is Grad-CAM?
  2. Explaining image classification using Grad-Cam 
    1. Importing libraries
    2.  Prediction function 
    3. Drawing box
    4. Importing image 
    5. Applying Grad-CAM

Let’s begin with understanding the Grad-Cam algorithm.

What is Grad-CAM?

In one of our articles, we have discussed the Grad-CAM algorithm that does it make many computer vision works explainable. We could understand that it is a way to make CNN models interpretable. We can discrete the words Grad-CAM as Gradient Weighted class Activation Map. To make newly made CNNs or pre-trained CNNs interpretable, Grad-CAM applies a heat map on the images and using this heatmap the Grad-CAM shows what are pixels from the image required by the model to classify the objects in the image.  

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In this article, we are going to discuss how we can apply Grad_CAM methods on any pre-trained models using the PyTorch library. We can find an implementation of Grad-CAM in PyTorch here

Talking about the implementation, this implementation includes various pixel attribution methods, and also this implementation is capable of working with classification, object detection, and semantic segmentation. Along with this, we can utilize this implementation with many CNN networks and vision transformers(ViT). With this implementation, we also get some of the modules that can help us define objects which are required by the Grad-CAM methods. In this article, we will discuss how we apply Grad-CAM for object detection with a faster R-CNN model. 

Download our Mobile App

We can install this implementation in our environment using the following lines of codes:

!pip install grad-cam

After installation, we are ready to work with Grad-CAM methods to make the CNN models interpretable.

Explaining image classification using Grad-Cam

Let’s start the process by importing modules from the package that we have installed.

import cv2
import numpy as np
import torch
import torchvision

Prediction function 

The below function will help us in defining the model that we are going to use and the size of the output tensor and predict the class name, label, score of prediction.

def predict(input_tensor, model, device, detection_threshold):
    outputs = model(input_tensor)
    pred_classes = [coco_names[i] for i in outputs[0]['labels'].cpu().numpy()]
    pred_labels = outputs[0]['labels'].cpu().numpy()
    pred_scores = outputs[0]['scores'].detach().cpu().numpy()
    pred_bboxes = outputs[0]['boxes'].detach().cpu().numpy()
    boxes, classes, labels, indices = [], [], [], []
    for index in range(len(pred_scores)):
        if pred_scores[index] >= detection_threshold:

Drawing box

The below function will help us in defining a box on the basis of predictions that our model is making.

def draw_boxes(boxes, labels, classes, image):
    for i, box in enumerate(boxes):
        color = COLORS[labels[i]]
            (int(box[0]), int(box[1])),
            (int(box[2]), int(box[3])),
            color, 2
        cv2.putText(image, classes[i], (int(box[0]), int(box[3] - 5)),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.8, color, 2,
    return image

Defining class names

coco_names = ['__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', \
              'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 
              'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 
              'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella',
              'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard',
              'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
              'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass', 'cup', 'fork',
              'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
              'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
              'potted plant', 'bed', 'N/A', 'dining table', 'N/A', 'N/A', 'toilet',
              'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
              'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', 'clock', 'vase',
              'scissors', 'teddy bear', 'hair drier', 'toothbrush']

Defining different colour box  

COLORS = np.random.uniform(0, 255, size=(len(coco_names), 3))

After defining all the above setup we are ready to make the Faster R-CNN model predictable.

Importing image 

To make the model predict the class of objects in the image we are using the below image.

from PIL import Image
image = np.array(Image.open("/content/download (2).jfif"))


Here we can see that we have a cat and dog in the image. Now we are required to define the correct setup for the image and model that will help us in making predictions.

Using the below lines of codes we can transform call and transform the image.

import torchvision
image_float_np = np.float32(image) / 255
transform = torchvision.transforms.Compose([

In the above, we have defined the prediction function where we have defined boxes, classes, labels, and indices.

Modelling and predicting 

 We are going to make a Faster CNN model prom torch vision to predict what the classes are present in the image using the bounding boxes.

input_tensor = transform(image)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_tensor = input_tensor.to(device)
input_tensor = input_tensor.unsqueeze(0)
model = torchvision.models.detection.fasterR-CNN_resnet50_fpn(pretrained=True)

Let’s visualize the results.

boxes, classes, labels, indices = predict(input_tensor, model, device, 0.9)
image1 = draw_boxes(boxes, labels, classes, image)


Here we can see that we have got good results from the model, predicted classes are right. 

Now in the above output, we can see that the results are good from the model, but what happened in the background is not known. We can consider the whole background process as a black-box process. In development, it may happen that the model is good with some of the samples and worse with some others.  This can cause a huge loss of accuracy of the model. To make the model more accurate we require improved interpretability of the models. 

Applying Grad-CAM

So in the case of CNN and ViT models, Grad-CAM comes to save us. Using the methods under Grad CAM we can check what pixels are responsible for predictions of models. In this article, we are going to see one of the methods from Grad-CAM named EigenCAM. Using the following lines of codes we can perform this.

from pytorch_grad_cam import AblationCAM, EigenCAM
from pytorch_grad_cam.ablation_layer import AblationLayerFasterR-CNN
from pytorch_grad_cam.utils.model_targets import FasterR-CNNBoxScoreTarget
from pytorch_grad_cam.utils.reshape_transforms import fasterR-CNN_reshape_transform
from pytorch_grad_cam.utils.image import show_cam_on_image, scale_accross_batch_and_channels, scale_cam_image
target_layers = [model.backbone]
targets = [FasterR-CNNBoxScoreTarget(labels=labels, bounding_boxes=boxes)]
cam = EigenCAM(model,
grayscale_cam = cam(input_tensor, targets=targets)
grayscale_cam = grayscale_cam[0, :]
cam_image = show_cam_on_image(image_float_np, grayscale_cam, use_rgb=True)
image_with_bounding_boxes = draw_boxes(boxes, labels, classes, cam_image)


Here in the above, we can see that to make the predictions using the objects in the image the models have gone through the pixels from the face mainly and the intensity of the colours in the heat map represents that the model has mainly used only those pixels to make predictions.

Also in the above codes, we can find that we have applied the Grad-CAM on the model.backbone layer. Since these layers compute the meaningful activation it is suggested to apply Grad-CAM methods in this portion of the Faster R-CNN model.

Final words

In the article, we introduced the Grad-CAM method to make the CNN or the CNN-like models interpretable in the context of their working procedure. We also discussed how we can apply them on Faster R-CNN using the PyTorch library.


Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.