Using Grad-CAM to Visually Verify the Performance of CNN Model

Grad CAM

We can all agree that Convolutional neural networks have proven to be very proficient in tasks like image classification, face recognition and document analysis. But with increasing efficiency and complexity, there is a gradual decrease in the interpretability of these algorithms. A solution to problems like face recognition involves hundreds of layers and thousands of parameters to train, making it difficult to read, debug and build trust in the model. CNNs appear to be black boxes that take in inputs and give outputs with great accuracy without giving an intuition about the working. 

As a deep learning engineer, it is your responsibility to make sure the model is working correctly. Suppose you are given a task of classifying different birds. The dataset contains images of different birds and plant/trees in the background. If the network is looking at the plants and trees instead of the bird, there is a good chance the network will misclassify the image and miss all the features of the bird. How do we know our model is looking at the right thing? Through this article, we will discuss how to address the risk of working with CNN models in a black-box manner and how can we identify whether CNN is working correctly with the features which are important for classification or recognition.

What will we discuss in this article?

  • What is Grad-CAM?
  • How to use Grad-CAM?
  • How does Grad-CAM visualize the region of interest of a CNN model?

What is Grad-CAM?

One way to ensure this is by visualizing what CNNs are actually looking at, using Grad-CAM. Gradient weighted Class Activation Map (Grad-CAM) produces a heat map that highlights the important regions of an image by using the gradients of the target(bird, elephant) of the final convolutional layer. 

We take the feature maps of the final layer, weigh every channel in that feature with the gradient of the class with respect to the channel. It tells us how intensely the input image activates different channels by how important each channel is with regard to the class. It does not require any re-training or change in the existing architecture. 


We begin with a pre-trained model like VGG. The dataset used here is ImageNet. ImageNet is a very large collection of annotated photographs and consists of 1000 classes. In this example, we will try to highlight a class called ‘Shades’ and apply Grad-cam on this. 

Loading Pre-Trained CNN model

from keras.applications.vgg16 import VGG16, preprocess_input, decode_predictions

from keras.preprocessing import image

import numpy as np

import cv2

import keras.backend as K

from skimage import io

model = VGG16(weights="imagenet")

Loading and preparing the Image

Now, choose any image from the internet that contains sunglasses. I have chosen an image of Tony Stark here. Since we are using a pre-trained model we need to make sure that our image size is 244×244. Once the image is resized the image is converted to an array.

sunglasses= io.imread("")


sunglasses = cv2.resize(sunglasses, dsize=(224, 224), interpolation=cv2.INTER_CUBIC)

x = image.img_to_array(sunglasses)

x = np.expand_dims(x, axis=0)

x = preprocess_input(x)

Making Prediction

We get the predictions of the images and take the output from the final convolution layer. Since ImageNet has 1000 classes where the label ‘shades/sunglasses’ belongs to class 837.


Model: “vgg16”


Layer (type)                 Output Shape              Param #   


input_1 (InputLayer)         (None, 224, 224, 3)       0         


block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      


block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     


block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         


block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     


block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    


block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         


block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    


block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    


block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    


block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         


block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   


block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   


block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   


block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         


block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   


block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   


block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   


block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         


flatten (Flatten)            (None, 25088)             0         


fc1 (Dense)                  (None, 4096)              102764544 


fc2 (Dense)                  (None, 4096)              16781312  


predictions (Dense)          (None, 1000)              4097000   


Total params: 138,357,544

Trainable params: 138,357,544

Non-trainable params: 0




preds = model.predict(x)

class_output = model.output[:, 837]

last_conv_layer = model.get_layer("block5_conv3")

Visualizing the Region of Interest of the CNN Model

We now have all the information needed for performing the visualization. Compute the gradients of the output class with respect to the features of the last layer. Then, sum up the gradients in all the axes and weigh the output feature map with the computed gradient values.

grads = K.gradients(class_output, last_conv_layer.output)[0]


pooled_grads = K.mean(grads, axis=(0, 1, 2))


iterate = K.function([model.input], [pooled_grads, last_conv_layer.output[0]])

pooled_grads_value, conv_layer_output_value = iterate([x])

for i in range(512):

  conv_layer_output_value[:, :, i] *= pooled_grads_value[i]

We take the average of the weighted feature map along the channel dimension resulting in a heat map of size 14×14 and normalize the map to lie between 0 and 1 and plot the map.

heatmap = np.mean(conv_layer_output_value, axis = -1)



heatmap = np.maximum(heatmap, 0)

heatmap /= np.max(heatmap)

heatmap = cv2.resize(heatmap, (sunglasses.shape[1], sunglasses.shape[0]))

heatmap = np.uint8(255 * heatmap)

heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)

superimposed_img = cv2.addWeighted(sunglasses, 0.5, heatmap, 0.5, 0)

from google.colab.patches import cv2_imshow




From the image above it is clear that the network is looking exactly where we want it to look and not misclassifying the image. Grad-CAM is not only useful for visualization but also prove to be really effective for debugging and fine-tuning the model to get better results

Download our Mobile App

Bhoomika Madhukar
I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox